Overview: • About Miller • Miller in 10 minutes • File formats • Miller features in the context of the Unix toolkit • Record-heterogeneity • Internationalization Using Miller: • FAQ • Sharing data with other languages • Cookbook part 1 • Cookbook part 2 • Cookbook part 3 • Data-diving examples • Manpage • Reference • Reference: Verbs • Reference: DSL • Documents by release • Installation, portability, dependencies, and testing Background: • Why? • Why C? • Why call it Miller? • How original is Miller? • Performance Repository: • Things to do • Contact information • GitHub repo |
• Fields not selected • Diagnosing delimiter specifications • How do I suppress numeric conversion? • How do I examine then-chaining? • I assigned $9 and it’s not 9th • How can I handle field names with special symbols in them? • How can I put single-quotes into strings? • Why doesn’t mlr cut put fields in the order I want? • NR is not consecutive after then-chaining • Why am I not seeing all possible joins occur? • How to rectangularize after joins with unpaired? • What about XML or JSON file formats? No output at all
Try od -xcv and/or cat -e on your file to check for non-printable characters.
If you’re using Miller version less than 5.0.0 (try
mlr --version on your system to find out), when the
line-ending-autodetect feature was introduced, please see
here.
Fields not selected
Check the field-separators of the data, e.g. with the command-line
head program. Example: for CSV, Miller’s default record
separator is comma; if your data is tab-delimited, e.g. aTABbTABc,
then Miller won’t find three fields named a, b, and
c but rather just one named aTABbTABc. Solution in this
case: mlr --fs tab {remaining arguments ...}.
Also try od -xcv and/or cat -e on your file to check for non-printable characters.
Diagnosing delimiter specifications# Use the `file` command to see if there are CR/LF terminators (in this case, # there are not): $ file data/colours.csv data/colours.csv: UTF-8 Unicode text # Look at the file to find names of fields $ cat data/colours.csv KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah # Extract a few fields: $ mlr --csv cut -f KEY,PL,RO data/colours.csv (only blank lines appear) # Use XTAB output format to get a sharper picture of where records/fields # are being split: $ mlr --icsv --oxtab cat data/colours.csv KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah # Using XTAB output format makes it clearer that KEY;DE;...;RO;TR is being # treated as a single field name in the CSV header, and likewise each # subsequent line is being treated as a single field value. This is because # the default field separator is a comma but we have semicolons here. # Use XTAB again with different field separator (--fs semicolon): mlr --icsv --ifs semicolon --oxtab cat data/colours.csv KEY masterdata_colourcode_1 DE Weiß EN White ES Blanco FI Valkoinen FR Blanc IT Bianco NL Wit PL Biały RO Alb TR Beyaz KEY masterdata_colourcode_2 DE Schwarz EN Black ES Negro FI Musta FR Noir IT Nero NL Zwart PL Czarny RO Negru TR Siyah # Using the new field-separator, retry the cut: mlr --csv --fs semicolon cut -f KEY,PL,RO data/colours.csv KEY;PL;RO masterdata_colourcode_1;Biały;Alb masterdata_colourcode_2;Czarny;Negru How do I suppress numeric conversion?
TL;DR use put -S.
Within mlr put and mlr filter, the default behavior for
scanning input records is to parse them as integer, if possible, then as float,
if possible, else leave them as string:
$ cat data/scan-example-1.tbl value 1 2.0 3x hello $ mlr --pprint put '$copy = $value; $type = typeof($value)' data/scan-example-1.tbl value copy type 1 1 int 2.0 2.000000 float 3x 3x string hello hello string
$ cat data/scan-example-2.tbl value 0001 0002 0005 0005WA 0006 0007 0007WA 0008 0009 0010 $ mlr --pprint put '$copy = $value; $type = typeof($value)' data/scan-example-2.tbl value copy type 0001 1 int 0002 2 int 0005 5 int 0005WA 0005WA string 0006 6 int 0007 7 int 0007WA 0007WA string 0008 8.000000 float 0009 9.000000 float 0010 8 int
$ mlr --pprint put -S '$copy = $value; $type = typeof($value)' data/scan-example-2.tbl value copy type 0001 0001 string 0002 0002 string 0005 0005 string 0005WA 0005WA string 0006 0006 string 0007 0007 string 0007WA 0007WA string 0008 0008 string 0009 0009 string 0010 0010 string How do I examine then-chaining?
Then-chaining found in Miller is intended to function the same as Unix
pipes, but with less keystroking. You can print your data one pipeline step at
a time, to see what intermediate output at one step becomes the input to the
next step.
First, look at the input data:
$ cat data/then-example.csv Status,Payment_Type,Amount paid,cash,10.00 pending,debit,20.00 paid,cash,50.00 pending,credit,40.00 paid,debit,30.00 $ mlr --icsv --opprint count-distinct -f Status,Payment_Type data/then-example.csv Status Payment_Type count paid cash 2 pending debit 1 pending credit 1 paid debit 1 $ mlr --icsv --opprint count-distinct -f Status,Payment_Type then sort -nr count data/then-example.csv Status Payment_Type count paid cash 2 pending debit 1 pending credit 1 paid debit 1 $ mlr --csv count-distinct -f Status,Payment_Type data/then-example.csv | mlr --icsv --opprint sort -nr count Status Payment_Type count paid cash 2 pending debit 1 pending credit 1 paid debit 1 I assigned $9 and it’s not 9th
Miller records are ordered lists of key-value pairs. For NIDX format, DKVP
format when keys are missing, or CSV/CSV-lite format with
--implicit-csv-header, Miller will sequentially assign keys of the
form 1, 2, etc. But these are not integer array indices:
they’re just field names taken from the initial field ordering in the
input data.
$ echo x,y,z | mlr --dkvp cat 1=x,2=y,3=z $ echo x,y,z | mlr --dkvp put '$6="a";$4="b";$55="cde"' 1=x,2=y,3=z,6=a,4=b,55=cde $ echo x,y,z | mlr --nidx cat x,y,z $ echo x,y,z | mlr --csv --implicit-csv-header cat 1,2,3 x,y,z $ echo x,y,z | mlr --dkvp rename 2,999 1=x,999=y,3=z $ echo x,y,z | mlr --dkvp rename 2,newname 1=x,newname=y,3=z $ echo x,y,z | mlr --csv --implicit-csv-header reorder -f 3,1,2 3,1,2 z,x,y How can I handle field names with special symbols in them?
Simply surround the field names with curly braces:
$ echo 'x.a=3,y:b=4,z/c=5' | mlr put '${product.all} = ${x.a} * ${y:b} * ${z/c}' x.a=3,y:b=4,z/c=5,product.all=60 How can I put single-quotes into strings?
This is a little tricky due to the shell’s handling of quotes. For simplicity, let’s first put
an update script into a file:
$a = "It's OK, I said, then 'for now'." $ echo a=bcd | mlr put -f data/single-quote-example.mlr a=It's OK, I said, then 'for now'. $ echo a=bcd | mlr put '$a="It'\''s OK, I said, '\''for now'\''."' a=It's OK, I said, 'for now'.
Why doesn’t mlr cut put fields in the order I want?
Example: columns x,i,a were requested but they appear here in the order a,i,x:
$ cat data/small a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533 a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797 a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776 a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463 a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729 $ mlr cut -f x,i,a data/small a=pan,i=1,x=0.3467901443380824 a=eks,i=2,x=0.7586799647899636 a=wye,i=3,x=0.20460330576630303 a=eks,i=4,x=0.38139939387114097 a=wye,i=5,x=0.5732889198020006 $ mlr cut -o -f x,i,a data/small x=0.3467901443380824,i=1,a=pan x=0.7586799647899636,i=2,a=eks x=0.20460330576630303,i=3,a=wye x=0.38139939387114097,i=4,a=eks x=0.5732889198020006,i=5,a=wye NR is not consecutive after then-chaining
Given this input data:
$ cat data/small a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533 a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797 a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776 a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463 a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729 $ mlr filter '$x > 0.5' then put '$NR = NR' data/small a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797,NR=2 a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729,NR=5 $ echo x=1,y=2,z=3 | mlr put '$nf1 = NF; $u = 4; $nf2 = NF; unset $x,$y,$z; $nf3 = NF' nf1=3,u=4,nf2=5,nf3=3 $ mlr --opprint --from data/small put ' begin{ @nr1 = 0 } @nr1 += 1; $nr1 = @nr1 ' \ then filter '$x>0.5' \ then put ' begin{ @nr2 = 0 } @nr2 += 1; $nr2 = @nr2 ' a b i x y nr1 nr2 eks pan 2 0.7586799647899636 0.5221511083334797 2 1 wye pan 5 0.5732889198020006 0.8636244699032729 5 2 $ mlr filter '$x > 0.5' then cat -n data/small n=1,a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797 n=2,a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729 Why am I not seeing all possible joins occur?
This section describes behavior before Miller 5.1.0. As of 5.1.0, -u is the default.
For example, the right file here has nine records, and the left file should
add in the hostname column — so the join output should also have
9 records:
$ mlr --icsvlite --opprint cat data/join-u-left.csv hostname ipaddr nadir.east.our.org 10.3.1.18 zenith.west.our.org 10.3.1.27 apoapsis.east.our.org 10.4.5.94 $ mlr --icsvlite --opprint cat data/join-u-right.csv ipaddr timestamp bytes 10.3.1.27 1448762579 4568 10.3.1.18 1448762578 8729 10.4.5.94 1448762579 17445 10.3.1.27 1448762589 12 10.3.1.18 1448762588 44558 10.4.5.94 1448762589 8899 10.3.1.27 1448762599 0 10.3.1.18 1448762598 73425 10.4.5.94 1448762599 12200 $ mlr --icsvlite --opprint join -s -j ipaddr -f data/join-u-left.csv data/join-u-right.csv ipaddr hostname timestamp bytes 10.3.1.27 zenith.west.our.org 1448762579 4568 10.4.5.94 apoapsis.east.our.org 1448762579 17445 10.4.5.94 apoapsis.east.our.org 1448762589 8899 10.4.5.94 apoapsis.east.our.org 1448762599 12200 $ mlr --icsvlite --opprint join -u -j ipaddr -f data/join-u-left.csv data/join-u-right.csv ipaddr hostname timestamp bytes 10.3.1.27 zenith.west.our.org 1448762579 4568 10.3.1.18 nadir.east.our.org 1448762578 8729 10.4.5.94 apoapsis.east.our.org 1448762579 17445 10.3.1.27 zenith.west.our.org 1448762589 12 10.3.1.18 nadir.east.our.org 1448762588 44558 10.4.5.94 apoapsis.east.our.org 1448762589 8899 10.3.1.27 zenith.west.our.org 1448762599 0 10.3.1.18 nadir.east.our.org 1448762598 73425 10.4.5.94 apoapsis.east.our.org 1448762599 12200 How to rectangularize after joins with unpaired?
Suppose you have the following two data files:
id,code 3,0000ff 2,00ff00 4,ff0000 id,color 4,red 2,green $ mlr --csv join -j id -f data/color-codes.csv data/color-names.csv id,code,color 4,ff0000,red 2,00ff00,green $ mlr --csv join --ul -j id -f data/color-codes.csv data/color-names.csv id,code,color 4,ff0000,red 2,00ff00,green id,code 3,0000ff $ mlr --csv join --ul -j id -f data/color-codes.csv then unsparsify --fill-with "" data/color-names.csv id,code,color 4,ff0000,red 2,00ff00,green 3,0000ff, What about XML or JSON file formats?
Miller handles
# DKVP x=1,y=2 z=3 # XML <table> <record> <field> <key> x </key> <value> 1 </value> </field> <field> <key> y </key> <value> 2 </value> </field> </record> <field> <key> z </key> <value> 3 </value> </field> <record> </record> </table> # JSON [{"x":1,"y":2},{"z":3}] |