FAQ

Number one FAQ

Please use mlr --csv --rs lf for native Un*x (linefeed-terminated) CSV files.

No output at all

Check the line-terminators of the data, e.g. with the command-line file program. Example: for CSV, Miller’s default line terminator is CR/LF (carriage return followed by linefeed, following RFC4180). Yet if your CSV has *nix-standard LF line endings, Miller will keep reading the file looking for a CR/LF which never appears. Solution in this case: tell Miller the input has LF line-terminator, e.g. mlr --csv --rs lf {remaining arguments ...}.

Also try od -xcv and/or cat -e on your file to check for non-printable characters.

Fields not selected

Check the field-separators of the data, e.g. with the command-line head program. Example: for CSV, Miller’s default record separator is comma; if your data is tab-delimited, e.g. aTABbTABc, then Miller won’t find three fields named a, b, and c but rather just one named aTABbTABc. Solution in this case: mlr --fs tab {remaining arguments ...}.

Also try od -xcv and/or cat -e on your file to check for non-printable characters.

Diagnosing delimiter specifications

# Use the `file` command to see if there are CR/LF terminators (in this case,
# there are not):
$ file colours.csv
colours.csv: UTF-8 Unicode text

# Look at the file to find names of fields
$ cat colours.csv
KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR
masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz
masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah

# Try (unsuccessfully) to extract a few fields:
$ mlr --csv cut -f KEY,PL,RO colours.csv
(no output)

# Use LF record separator (--rs lf) since the file doesn't have CR/LF line
# endings -- but still unsuccessfully:
$ mlr --csv --rs lf cut -f KEY,PL,RO colours.csv
(only blank lines appear)

# Use XTAB output format to get a sharper picture of where records/fields
# are being split:
$ mlr --icsv --irs lf --oxtab cat colours.csv
KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz

KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah

# Using XTAB output format makes it clearer that KEY;DE;...;RO;TR is being
# treated as a single field name in the CSV header, and likewise each
# subsequent line is being treated as a single field value. This is because
# the default field separator is a comma but we have semicolons here.
# Use XTAB again with different field separator (--fs semicolon):
$ mlr --icsv --irs lf --ifs semicolon --oxtab cat colours.csv
KEY masterdata_colourcode_1
DE  Weiß
EN  White
ES  Blanco
FI  Valkoinen
FR  Blanc
IT  Bianco
NL  Wit
PL  Biały
RO  Alb
TR  Beyaz

KEY masterdata_colourcode_2
DE  Schwarz
EN  Black
ES  Negro
FI  Musta
FR  Noir
IT  Nero
NL  Zwart
PL  Czarny
RO  Negru
TR  Siyah

# Using the new field-separator, retry the cut:
$ mlr --csv --rs lf --fs semicolon cut -f KEY,PL,RO colours.csv
KEY;PL;RO
masterdata_colourcode_1;Biały;Alb
masterdata_colourcode_2;Czarny;Negru

Error-output in certain string cases

mlr put '$y = string($x); $z=$y.$y' gives (error) on numeric data such as x=123 while mlr put '$z=string($x).string($x)' does not. This is because in the former case y is computed and stored as a string, then re-parsed as an integer, for which string-concatenation is an invalid operator.

How do I examine then-chaining?

Then-chaining found in Miller is intended to function the same as Unix pipes. You can print your data one pipeline step at a time, to see what intermediate output at one step becomes the input to the next step.

First, review the input data:

$ cat data/then-example.csv
Status,Payment_Type,Amount
paid,cash,10.00
pending,debit,20.00
paid,cash,50.00
pending,credit,40.00
paid,debit,30.00

Next, run the first step of your command, omitting anything from the first then onward:

$ mlr --icsv --rs lf --opprint count-distinct -f Status,Payment_Type data/then-example.csv
Status  Payment_Type count
paid    cash         2
pending debit        1
pending credit       1
paid    debit        1

After that, run it with the next then step included:

$ mlr --icsv --rs lf --opprint count-distinct -f Status,Payment_Type then sort -nr count data/then-example.csv
Status  Payment_Type count
paid    cash         2
pending debit        1
pending credit       1
paid    debit        1

Now if you include another then step after this, the columns Status, Payment_Type, and count will be its input.

Note, by the way, that you’ll get the same results using pipes:

$ mlr --csv --rs lf count-distinct -f Status,Payment_Type data/then-example.csv | mlr --icsv --rs lf --opprint sort -nr count
Status  Payment_Type count
paid    cash         2
pending debit        1
pending credit       1
paid    debit        1

Why doesn’t mlr cut put fields in the order I want?

Example: columns x,i,a were requested but they appear here in the order a,i,x:

$ cat data/small
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729

$ mlr cut -f x,i,a data/small
a=pan,i=1,x=0.3467901443380824
a=eks,i=2,x=0.7586799647899636
a=wye,i=3,x=0.20460330576630303
a=eks,i=4,x=0.38139939387114097
a=wye,i=5,x=0.5732889198020006

The issue is that Miller’s cut, by default, outputs cut fields in the order they appear in the input data. This design decision was made intentionally to parallel the *nix system cut command, which has the same semantics.

The solution is to use the -o option:

$ mlr cut -o -f x,i,a data/small
x=0.3467901443380824,i=1,a=pan
x=0.7586799647899636,i=2,a=eks
x=0.20460330576630303,i=3,a=wye
x=0.38139939387114097,i=4,a=eks
x=0.5732889198020006,i=5,a=wye

Why am I not seeing all possible joins occur?

For example, the right file here has nine records, and the left file should add in the hostname column — so the join output should also have 9 records:

$ mlr --icsvlite --opprint cat data/join-u-left.csv
hostname              ipaddr
nadir.east.our.org    10.3.1.18
zenith.west.our.org   10.3.1.27
apoapsis.east.our.org 10.4.5.94

$ mlr --icsvlite --opprint cat data/join-u-right.csv
ipaddr    timestamp  bytes
10.3.1.27 1448762579 4568
10.3.1.18 1448762578 8729
10.4.5.94 1448762579 17445
10.3.1.27 1448762589 12
10.3.1.18 1448762588 44558
10.4.5.94 1448762589 8899
10.3.1.27 1448762599 0
10.3.1.18 1448762598 73425
10.4.5.94 1448762599 12200

$ mlr --icsvlite --opprint join -j ipaddr -f data/join-u-left.csv data/join-u-right.csv
ipaddr    hostname              timestamp  bytes
10.3.1.27 zenith.west.our.org   1448762579 4568
10.4.5.94 apoapsis.east.our.org 1448762579 17445
10.4.5.94 apoapsis.east.our.org 1448762589 8899
10.4.5.94 apoapsis.east.our.org 1448762599 12200

The issue is that Miller’s join, by default, takes input sorted (lexically ascending) by the sort keys on both the left and right files. This design decision was made intentionally to parallel the *nix system join command, which has the same semantics. The benefit of this default is that the joiner program can stream through the left and right files, needing to load neither entirely into memory. The drawback, of course, is that is requires sorted input.

The solution (besides pre-sorting the input files on the join keys) is to simply use mlr join -u. This loads the left file entirely into memory (while the right file is still streamed one line at a time) and does all possible joins without requiring sorted input: POKI_RUN_COMMAND{{mlr --icsvlite --opprint join -u -j ipaddr -f data/join-u-left.csv data/join-u-right.csv}}HERE

General advice is to make sure the left-file is relatively small, e.g. containing name-to-number mappings, while saving large amounts of data for the right file.

What about XML or JSON file formats?

Miller handles tabular data, which is a list of records each having fields which are key-value pairs. Miller also doesn’t require that each record have the same field names (see also here). Regardless, tabular data is a non-recursive data structure.

XML, JSON, etc. are, by contrast, all recursive or nested data structures. For example, in JSON you can represent a hash map whose values are lists of lists.

Now, you can put tabular data into these formats — since list-of-key-value-pairs is one of the things representable in XML or JSON. Example:

# DKVP
x=1,y=2
z=3

# XML
<table>
  <record>
    <field>
      <key> x </key> <value> 1 </value>
    </field>
    <field>
      <key> y </key> <value> 2 </value>
    </field>
  </record>
    <field>
      <key> z </key> <value> 3 </value>
    </field>
  <record>
  </record>
</table>

# JSON
[{"x":1,"y":2},{"z":3}]

However, a tool like Miller which handles non-recursive data is never going to be able to handle full XML/JSON semantics — only a small subset. If tabular data represented in XML/JSON/etc are sufficiently well-structured, it may be easy to grep/sed out the data into a simpler text form — but this is a general text-processing problem.

My preference is to keep Miller doing what it does well, and to leave XML to XML tools such as xmllint and JSON to JSON tools such as jq or recs. Putting (necessarily) limited support for these file formats into Miller seems like a slippery slope wherein inadequate solutions would be delivered for an inherently unattainable goal.