File formats

CSV/TSV

When mlr is invoked with the --csv option, key names are found on the first record and values are taken from subsequent records. This includes the case of CSV-formatted files. See Record-heterogeneity for how Miller handles changes of field names within a single data stream.

Miller has record separator RS and field separator FS, just as awk does. For TSV, use --fs tab; to convert TSV to CSV, use --ifs tab --ofs , etc. (See also Reference.)

Pretty-printed

Miller’s pretty-print format is like CSV, but column-aligned. For example, compare

$ mlr --ocsv cat data/small
a,b,i,x,y
pan,pan,1,0.3467901443380824,0.7268028627434533
eks,pan,2,0.7586799647899636,0.5221511083334797
wye,wye,3,0.20460330576630303,0.33831852551664776
eks,wye,4,0.38139939387114097,0.13418874328430463
wye,pan,5,0.5732889198020006,0.8636244699032729

$ mlr --opprint cat data/small
a   b   i x                   y
pan pan 1 0.3467901443380824  0.7268028627434533
eks pan 2 0.7586799647899636  0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan 5 0.5732889198020006  0.8636244699032729

Note that while Miller is a line-at-a-time processor and retains input lines in memory only where necessary (e.g. for sort), pretty-print output requires it to accumulate all input lines (so that it can compute maximum column widths) before producing any output. This has two consequences: (a) pretty-print output won’t work on tail -f contexts, where Miller will be waiting for an end-of-file marker which never arrives; (b) pretty-print output for large files is constrained by available machine memory.

See Record-heterogeneity for how Miller handles changes of field names within a single data stream.

Key-value pairs

Miller’s default file format is DKVP, for delimited key-value pairs. Example:

$ mlr cat data/small
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729

Such data are easy to generate, e.g. in Ruby with

puts "host=#{hostname},seconds=#{t2-t1},message=#{msg}"

puts mymap.collect{|k,v| "#{k}=#{v}"}.join(',')

or print statements in various languages, e.g.

echo "type=3,user=$USER,date=$date\n";

logger.log("type=3,user=$USER,date=$date\n");

As discussed in Record-heterogeneity, Miller handles changes of field names within the same data stream. But using DKVP format this is particularly natural. One of my favorite use-cases for Miller is in application/server logs, where I log all sorts of lines such as

resource=/path/to/file,loadsec=0.45,ok=true
record_count=100, resource=/path/to/file
resource=/some/other/path,loadsec=0.97,ok=false

etc. and I just log them as needed. Then later, I can use grep, mlr --opprint group-like, etc. to analyze my logs.

See Reference regarding how to specify separators other than the default equals-sign and comma.

Index-numbered (toolkit style)

With --inidx --ifs ' ' --repifs, Miller splits lines on whitespace and assigns integer field names starting with 1. This recapitulates Unix-toolkit behavior.

Example with index-numbered output:

$ cat data/small
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729

$ mlr --onidx --ofs ' ' cat data/small
pan pan 1 0.3467901443380824 0.7268028627434533
eks pan 2 0.7586799647899636 0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan 5 0.5732889198020006 0.8636244699032729

Example with index-numbered input:

$ cat data/mydata.txt
oh say can you see
by the dawn's
early light

$ mlr --inidx --ifs ' ' --odkvp cat data/mydata.txt
1=oh,2=say,3=can,4=you,5=see
1=by,2=the,3=dawn's
1=early,2=light

Example with index-numbered input and output:

$ cat data/mydata.txt
oh say can you see
by the dawn's
early light

$ mlr --nidx --fs ' ' --repifs cut -f 2,3 data/mydata.txt
say can
the dawn's
light

Vertical tabular

This is perhaps most useful for looking a very wide and/or multi-column data which causes line-wraps on the screen (but see also https://github.com/twosigma/ngrid for an entirely different, very powerful option). Namely:

$ grep -v '^#' /etc/passwd | head -n 6 | mlr --nidx --fs : --opprint cat
1          2 3  4  5                          6               7
nobody     * -2 -2 Unprivileged User          /var/empty      /usr/bin/false
root       * 0  0  System Administrator       /var/root       /bin/sh
daemon     * 1  1  System Services            /var/root       /usr/bin/false
_uucp      * 4  4  Unix to Unix Copy Protocol /var/spool/uucp /usr/sbin/uucico
_taskgated * 13 13 Task Gate Daemon           /var/empty      /usr/bin/false
_networkd  * 24 24 Network Services           /var/networkd   /usr/bin/false

$ grep -v '^#' /etc/passwd | head -n 2 | mlr --nidx --fs : --oxtab cat
1 nobody
2 *
3 -2
4 -2
5 Unprivileged User
6 /var/empty
7 /usr/bin/false

1 root
2 *
3 0
4 0
5 System Administrator
6 /var/root
7 /bin/sh