Data types
List of types
Miller's types are:
- Scalars:
- string: such as
"abcdefg"
, supporting concatenation, one-up indexing and slicing, and library functions. See the pages on strings and regular expressions. - float and int: such as
1.2
and3
: double-precision and 64-bit signed, respectively. See the section on arithmetic operators and math-related library functions as well as the Arithmetic page. - dates/times are not a separate data type; Miller uses ints for seconds since the epoch and strings for formatted date/times. See the DSL datetime/timezone functions page for more information.
- boolean: literals
true
andfalse
; results of==
,<
,>
, etc. See the section on boolean operators.
- string: such as
- Collections:
- map: such as
{"a":1,"b":[2,3,4]}
, supporting key-indexing, preservation of insertion order, library functions, etc. See the Maps page. - array: such as
["a", 2, true]
, supporting one-up indexing and slicing, library functions, etc. See the Arrays page.
- map: such as
- Nulls and error:
- absent-null: Such as on reads of unset right-hand sides, or fall-through non-explicit return values from user-defined functions. See the null-data page.
- JSON-null: For
null
in JSON files; also used in gapped auto-extend of arrays. See the null-data page. - error -- for various results which cannot be computed, often when the input to a built-in function is of the wrong type. For example, doing strlen or substr on a non-string, sec2gmt on a non-integer, etc.
See also the list of type-checking functions for the Miller programming language.
See also Differences from other programming languages.
Type inference for literal and record data
Miller's input and output are all text-oriented: all the file formats supported by Miller are human-readable text, such as CSV, TSV, and JSON; binary formats such as BSON and Parquet are not supported (as of mid-2021). In this sense, everything is a string in and out of Miller -- be it in data files, or in DSL expressions you key in.
In the DSL, 7
is an int
and 8.9
is a float, as
one would expect. Likewise, on input from data files,
string values representable as numbers, e.g. 1.2
or 3
, are treated as int
or float, respectively. If a record has x=1,y=2
then mlr put '$z=$x+$y'
will produce x=1,y=2,z=3
.
Numbers retain their original string representation, so if x
is 1.2
on one
record and 1.200
on another, they'll print out that way on output (unless of
course they've been modified during processing, e.g. mlr put '$x = $x + 10
).
Generally strings, numbers, and booleans don't mix; use type-casting like
string($x)
to convert. However, the dot (string-concatenation) operator has
been special-cased: mlr put '$z=$x.$y'
does not give an error, because the
dot operator has been generalized to stringify non-strings
Examples:
mlr --csv cat data/type-infer.csv
a,b,c 1.2,3,true 4,5.6,buongiorno
mlr --icsv --oxtab --from data/type-infer.csv put ' $d = $a . $c; $e = 7; $f = 8.9; $g = $e + $f; $ta = typeof($a); $tb = typeof($b); $tc = typeof($c); $td = typeof($d); $te = typeof($e); $tf = typeof($f); $tg = typeof($g); ' then reorder -f a,ta,b,tb,c,tc,d,td,e,te,f,tf,g,tg
a 1.2 ta float b 3 tb int c true tc string d 1.2true td string e 7 te int f 8.9 tf float g 15.9 tg float a 4 ta int b 5.6 tb float c buongiorno tc string d 4buongiorno td string e 7 te int f 8.9 tf float g 15.9 tg float
On input, string values representable as boolean (e.g. "true"
, "false"
)
are not automatically treated as boolean. This is because "true"
and
"false"
are ordinary words, and auto string-to-boolean on a column consisting
of words would result in some strings mixed with some booleans. Use the
boolean
function to coerce: e.g. giving the record x=1,y=2,w=false
to mlr
filter '$z=($x<$y) || boolean($w)'
.
The same is true for inf
, +inf
, -inf
, infinity
, +infinity
,
-infinity
, NaN
, and all upper-cased/lower-cased/mixed-case variants of
those. These are valid IEEE floating-point numbers, but Miller treats these as
strings. You can explicit force conversion: if x=infinity
in a data file,
then typeof($x)
is string
but typeof(float($x))
is float
.
JSON parse and stringify
If you have, say, a CSV file whose columns contain strings which are well-formatted JSON,
they will not be auto-converted, but you can use the
json-parse
verb
or the
json_parse
DSL function:
mlr --csv --from data/json-in-csv.csv cat
id,blob 100,"{""a"":1,""b"":[2,3,4]}" 105,"{""a"":6,""b"":[7,8,9]}"
mlr --icsv --ojson --from data/json-in-csv.csv cat
[ { "id": 100, "blob": "{\"a\":1,\"b\":[2,3,4]}" }, { "id": 105, "blob": "{\"a\":6,\"b\":[7,8,9]}" } ]
mlr --icsv --ojson --from data/json-in-csv.csv json-parse -f blob
[ { "id": 100, "blob": { "a": 1, "b": [2, 3, 4] } }, { "id": 105, "blob": { "a": 6, "b": [7, 8, 9] } } ]
mlr --icsv --ojson --from data/json-in-csv.csv put '$blob = json_parse($blob)'
[ { "id": 100, "blob": { "a": 1, "b": [2, 3, 4] } }, { "id": 105, "blob": { "a": 6, "b": [7, 8, 9] } } ]
These have their respective operations to convert back to string: the
json-stringify
verb
and
json_stringify
DSL function.