|
File formats
CSV/TSV
When mlr is invoked with the --csv option,
key names are found on the first record and values are taken from subsequent
records. This includes the case of CSV-formatted files. See
Record-heterogeneity for how Miller handles
changes of field names within a single data stream.
Miller has record separator RS and field separator FS,
just as awk does. For TSV, use --fs tab; to convert TSV to
CSV, use --ifs tab --ofs , etc. (See also
Reference.)
Pretty-printed
Miller’s pretty-print format is like CSV, but column-aligned. For example, compare
$ mlr --ocsv cat data/small
a,b,i,x,y
pan,pan,1,0.3467901443380824,0.7268028627434533
eks,pan,2,0.7586799647899636,0.5221511083334797
wye,wye,3,0.20460330576630303,0.33831852551664776
eks,wye,4,0.38139939387114097,0.13418874328430463
wye,pan,5,0.5732889198020006,0.8636244699032729
|
$ mlr --opprint cat data/small
a b i x y
pan pan 1 0.3467901443380824 0.7268028627434533
eks pan 2 0.7586799647899636 0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan 5 0.5732889198020006 0.8636244699032729
|
Note that while Miller is a line-at-a-time processor and retains input lines in
memory only where necessary (e.g. for sort), pretty-print output requires it to
accumulate all input lines (so that it can compute maximum column widths)
before producing any output. This has two consequences: (a) pretty-print output
won’t work on tail -f contexts, where Miller will be waiting for
an end-of-file marker which never arrives; (b) pretty-print output for large
files is constrained by available machine memory.
See Record-heterogeneity for how Miller
handles changes of field names within a single data stream.
Key-value pairs
Miller’s default file format is DKVP, for delimited key-value pairs. Example:
$ mlr cat data/small
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
Such data are easy to generate, e.g. in Ruby with
puts "host=#{hostname},seconds=#{t2-t1},message=#{msg}"
puts mymap.collect{|k,v| "#{k}=#{v}"}.join(',')
or print statements in various languages, e.g.
echo "type=3,user=$USER,date=$date\n";
logger.log("type=3,user=$USER,date=$date\n");
As discussed in Record-heterogeneity, Miller handles
changes of field names within the same data stream. But using DKVP format this is particularly
natural. One of my favorite use-cases for Miller is in application/server logs, where I log all sorts
of lines such as
resource=/path/to/file,loadsec=0.45,ok=true
record_count=100, resource=/path/to/file
resource=/some/other/path,loadsec=0.97,ok=false
etc. and I just log them as needed. Then later, I can use grep, mlr --opprint group-like, etc.
to analyze my logs.
See Reference regarding how to specify separators other than
the default equals-sign and comma.
Index-numbered (toolkit style)
With --inidx --ifs ' ' --repifs, Miller splits lines on whitespace and
assigns integer field names starting with 1. This recapitulates Unix-toolkit
behavior.
Example with index-numbered output:
$ cat data/small
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
|
$ mlr --onidx --ofs ' ' cat data/small
pan pan 1 0.3467901443380824 0.7268028627434533
eks pan 2 0.7586799647899636 0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan 5 0.5732889198020006 0.8636244699032729
|
Example with index-numbered input:
$ cat data/mydata.txt
oh say can you see
by the dawn's
early light
|
$ mlr --inidx --ifs ' ' --odkvp cat data/mydata.txt
1=oh,2=say,3=can,4=you,5=see
1=by,2=the,3=dawn's
1=early,2=light
|
Example with index-numbered input and output:
$ cat data/mydata.txt
oh say can you see
by the dawn's
early light
|
$ mlr --nidx --fs ' ' --repifs cut -f 2,3 data/mydata.txt
say can
the dawn's
light
|
Vertical tabular
This is perhaps most useful for looking a very wide and/or multi-column
data which causes line-wraps on the screen (but see also https://github.com/twosigma/ngrid
for an entirely different, very powerful option). Namely:
$ grep -v '^#' /etc/passwd | head -n 6 | mlr --nidx --fs : --opprint cat
1 2 3 4 5 6 7
nobody * -2 -2 Unprivileged User /var/empty /usr/bin/false
root * 0 0 System Administrator /var/root /bin/sh
daemon * 1 1 System Services /var/root /usr/bin/false
_uucp * 4 4 Unix to Unix Copy Protocol /var/spool/uucp /usr/sbin/uucico
_taskgated * 13 13 Task Gate Daemon /var/empty /usr/bin/false
_networkd * 24 24 Network Services /var/networkd /usr/bin/false
|
$ grep -v '^#' /etc/passwd | head -n 2 | mlr --nidx --fs : --oxtab cat
1 nobody
2 *
3 -2
4 -2
5 Unprivileged User
6 /var/empty
7 /usr/bin/false
1 root
2 *
3 0
4 0
5 System Administrator
6 /var/root
7 /bin/sh
|
|