Overview: • About Miller • Miller in 10 minutes • File formats • Miller features in the context of the Unix toolkit • Record-heterogeneity • Internationalization Using Miller: • FAQ • Sharing data with other languages • Cookbook part 1 • Cookbook part 2 • Cookbook part 3 • Data-diving examples • Manpage • Reference • Reference: Verbs • Reference: DSL • Documents by release • Installation, portability, dependencies, and testing Background: • Why? • Why C? • Why call it Miller? • How original is Miller? • Performance Repository: • Things to do • Contact information • GitHub repo |
• Examples • CSV/TSV/etc. • DKVP: Key-value pairs • NIDX: Index-numbered (toolkit style) • Tabular JSON • Single-level JSON objects • Nested JSON objects • Arrays • Formatting JSON options • JSON non-streaming • PPRINT: Pretty-printed tabular • XTAB: Vertical tabular • Markdown tabular • Data-conversion keystroke-savers • Autodetect of line endings • Comments in data Overview
Miller handles name-indexed data using several formats: some you probably
know by name, such as CSV, TSV, and JSON — and other formats you’re
likely already seeing and using in your structured data. Additionally, Miller gives
you the option of including comments within your data.
Examples$ mlr --usage-data-format-examples DKVP: delimited key-value pairs (Miller default format) +---------------------+ | apple=1,bat=2,cog=3 | Record 1: "apple" => "1", "bat" => "2", "cog" => "3" | dish=7,egg=8,flint | Record 2: "dish" => "7", "egg" => "8", "3" => "flint" +---------------------+ NIDX: implicitly numerically indexed (Unix-toolkit style) +---------------------+ | the quick brown | Record 1: "1" => "the", "2" => "quick", "3" => "brown" | fox jumped | Record 2: "1" => "fox", "2" => "jumped" +---------------------+ CSV/CSV-lite: comma-separated values with separate header line +---------------------+ | apple,bat,cog | | 1,2,3 | Record 1: "apple => "1", "bat" => "2", "cog" => "3" | 4,5,6 | Record 2: "apple" => "4", "bat" => "5", "cog" => "6" +---------------------+ Tabular JSON: nested objects are supported, although arrays within them are not: +---------------------+ | { | | "apple": 1, | Record 1: "apple" => "1", "bat" => "2", "cog" => "3" | "bat": 2, | | "cog": 3 | | } | | { | | "dish": { | Record 2: "dish:egg" => "7", "dish:flint" => "8", "garlic" => "" | "egg": 7, | | "flint": 8 | | }, | | "garlic": "" | | } | +---------------------+ PPRINT: pretty-printed tabular +---------------------+ | apple bat cog | | 1 2 3 | Record 1: "apple => "1", "bat" => "2", "cog" => "3" | 4 5 6 | Record 2: "apple" => "4", "bat" => "5", "cog" => "6" +---------------------+ XTAB: pretty-printed transposed tabular +---------------------+ | apple 1 | Record 1: "apple" => "1", "bat" => "2", "cog" => "3" | bat 2 | | cog 3 | | | | dish 7 | Record 2: "dish" => "7", "egg" => "8" | egg 8 | +---------------------+ Markdown tabular (supported for output only): +-----------------------+ | | apple | bat | cog | | | | --- | --- | --- | | | | 1 | 2 | 3 | | Record 1: "apple => "1", "bat" => "2", "cog" => "3" | | 4 | 5 | 6 | | Record 2: "apple" => "4", "bat" => "5", "cog" => "6" +-----------------------+ CSV/TSV/etc.
When mlr is invoked with the --csv or --csvlite option,
key names are found on the first record and values are taken from subsequent
records. This includes the case of CSV-formatted files. See
Record-heterogeneity for how Miller handles
changes of field names within a single data stream.
Miller has record separator RS and field separator FS,
just as awk does. For TSV, use --fs tab; to convert TSV to
CSV, use --ifs tab --ofs comma, etc. (See also
Reference.)
The following are synonymous pairs:
DKVP: Key-value pairs
Miller’s default file format is DKVP, for delimited key-value pairs. Example:
$ mlr cat data/small a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533 a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797 a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776 a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463 a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729 puts "host=#{hostname},seconds=#{t2-t1},message=#{msg}" puts mymap.collect{|k,v| "#{k}=#{v}"}.join(',') echo "type=3,user=$USER,date=$date\n"; logger.log("type=3,user=$USER,date=$date\n"); resource=/path/to/file,loadsec=0.45,ok=true record_count=100, resource=/path/to/file resource=/some/other/path,loadsec=0.97,ok=false NIDX: Index-numbered (toolkit style)
With --inidx --ifs ' ' --repifs, Miller splits lines on whitespace and
assigns integer field names starting with 1. This recapitulates Unix-toolkit
behavior.
Example with index-numbered output:
Tabular JSON
JSON is a format which supports arbitrarily deep nesting of
“objects” (hashmaps) and “arrays” (lists), while Miller
is a tool for handling
Single-level JSON objectsAn$ mlr --json head -n 2 then cut -f color,shape data/json-example-1.json { "color": "yellow", "shape": "triangle" } { "color": "red", "shape": "square" } $ mlr --json --jvstack head -n 2 then cut -f color,u,v data/json-example-1.json { "color": "yellow", "u": 0.6321695890307647, "v": 0.9887207810889004 } { "color": "red", "u": 0.21966833570651523, "v": 0.001257332190235938 } $ mlr --ijson --opprint stats1 -a mean,stddev,count -f u -g shape data/json-example-1.json shape u_mean u_stddev u_count triangle 0.583995 0.131184 3 square 0.409355 0.365428 4 circle 0.366013 0.209094 3 Nested JSON objectsAdditionally, Miller can$ mlr --json --jvstack head -n 2 data/json-example-2.json { "flag": 1, "i": 11, "attributes": { "color": "yellow", "shape": "triangle" }, "values": { "u": 0.632170, "v": 0.988721, "w": 0.436498, "x": 5.798188 } } { "flag": 1, "i": 15, "attributes": { "color": "red", "shape": "square" }, "values": { "u": 0.219668, "v": 0.001257, "w": 0.792778, "x": 2.944117 } } $ mlr --ijson --opprint head -n 4 data/json-example-2.json flag i attributes:color attributes:shape values:u values:v values:w values:x 1 11 yellow triangle 0.632170 0.988721 0.436498 5.798188 1 15 red square 0.219668 0.001257 0.792778 2.944117 1 16 red circle 0.209017 0.290052 0.138103 5.065034 0 48 red square 0.956274 0.746720 0.775542 7.117831 $ mlr --json --jvstack head -n 1 then put '${values:uv} = ${values:u} * ${values:v}' data/json-example-2.json { "flag": 1, "i": 11, "attributes": { "color": "yellow", "shape": "triangle" }, "values": { "u": 0.632170, "v": 0.988721, "w": 0.436498, "x": 5.798188, "uv": 0.625040 } } ArraysArrays aren’t supported in Miller’s put/filter DSL. By default, JSON arrays are read in as integer-keyed maps. Suppose you have arrays like this in our input data:$ cat data/json-example-3.json { "label": "orange", "values": [12.2, 13.8, 17.2] } { "label": "purple", "values": [27.0, 32.4] } $ mlr --ijson --oxtab cat data/json-example-3.json label orange values:0 12.2 values:1 13.8 values:2 17.2 label purple values:0 27.0 values:1 32.4 $ mlr --json --jvstack cat data/json-example-3.json { "label": "orange", "values": { "0": 12.2, "1": 13.8, "2": 17.2 } } { "label": "purple", "values": { "0": 27.0, "1": 32.4 } } Formatting JSON optionsJSON isn’t a parameterized format, so RS, FS, PS aren’t specifiable. Nonetheless, you can do the following:
JSON non-streamingThe JSON parser Miller uses does not return until all input is parsed: in particular this means that, unlike for other file formats, Miller does not (at present) handle JSON files in tail -f contexts.PPRINT: Pretty-printed tabular
Miller’s pretty-print format is like CSV, but column-aligned. For example, compare
$ mlr --opprint --barred cat data/small +-----+-----+---+---------------------+---------------------+ | a | b | i | x | y | +-----+-----+---+---------------------+---------------------+ | pan | pan | 1 | 0.3467901443380824 | 0.7268028627434533 | | eks | pan | 2 | 0.7586799647899636 | 0.5221511083334797 | | wye | wye | 3 | 0.20460330576630303 | 0.33831852551664776 | | eks | wye | 4 | 0.38139939387114097 | 0.13418874328430463 | | wye | pan | 5 | 0.5732889198020006 | 0.8636244699032729 | +-----+-----+---+---------------------+---------------------+ XTAB: Vertical tabular
This is perhaps most useful for looking a very wide and/or multi-column
data which causes line-wraps on the screen (but see also https://github.com/twosigma/ngrid
for an entirely different, very powerful option). Namely:
Markdown tabular
Markdown format looks like this:
$ mlr --omd cat data/small | a | b | i | x | y | | --- | --- | --- | --- | --- | | pan | pan | 1 | 0.3467901443380824 | 0.7268028627434533 | | eks | pan | 2 | 0.7586799647899636 | 0.5221511083334797 | | wye | wye | 3 | 0.20460330576630303 | 0.33831852551664776 | | eks | wye | 4 | 0.38139939387114097 | 0.13418874328430463 | | wye | pan | 5 | 0.5732889198020006 | 0.8636244699032729 | Data-conversion keystroke-savers
While you can do format conversion using mlr --icsv --ojson cat myfile.csv,
there are also keystroke-savers for this purpose, such as mlr --c2j cat myfile.csv.
For a complete list:
$ mlr --usage-format-conversion-keystroke-saver-options As keystroke-savers for format-conversion you may use the following: --c2t --c2d --c2n --c2j --c2x --c2p --c2m --t2c --t2d --t2n --t2j --t2x --t2p --t2m --d2c --d2t --d2n --d2j --d2x --d2p --d2m --n2c --n2t --n2d --n2j --n2x --n2p --n2m --j2c --j2t --j2d --j2n --j2x --j2p --j2m --x2c --x2t --x2d --x2n --x2j --x2p --x2m --p2c --p2t --p2d --p2n --p2j --p2x --p2m The letters c t d n j x p m refer to formats CSV, TSV, DKVP, NIDX, JSON, XTAB, PPRINT, and markdown, respectively. Note that markdown format is available for output only. Autodetect of line endings
Default line endings (--irs and --ors) are 'auto'
which means
Comments in data
You can include comments within your data files, and either have them ignored, or passed directly
through to the standard output as soon as they are encountered:
$ mlr --usage-comments-in-data --skip-comments Ignore commented lines (prefixed by "#") within the input. --skip-comments-with {string} Ignore commented lines within input, with specified prefix. --pass-comments Immediately print commented lines (prefixed by "#") within the input. --pass-comments-with {string} Immediately print commented lines within input, with specified prefix. Notes: * Comments are only honored at the start of a line. * In the absence of any of the above four options, comments are data like any other text. * When pass-comments is used, comment lines are written to standard output immediately upon being read; they are not part of the record stream. Results may be counterintuitive. A suggestion is to place comments at the start of data files. $ cat data/budget.csv # Asana -- here are the budget figures you asked for! type,quantity purple,456.78 green,678.12 orange,123.45 $ mlr --skip-comments --icsv --opprint sort -nr quantity data/budget.csv type quantity green 678.12 purple 456.78 orange 123.45 $ mlr --pass-comments --icsv --opprint sort -nr quantity data/budget.csv # Asana -- here are the budget figures you asked for! type quantity green 678.12 purple 456.78 orange 123.45 |