Likewise, if you need to produce CSV which is lacking its header, you can pipe Miller’s output
to the system command sed 1d, or you can use Miller’s --headerless-csv-output option:
Lastly, often we say “CSV” or “TSV” when we have
positionally indexed data in columns which are separated by commas or tabs,
respectively. In this case it’s perhaps simpler to just use NIDX
format which was designed for this purpose. (See also
File formats.) For example:
$ mlr --inidx --ifs comma --oxtab cut -f 1,3 data/headerless.csv
1 John
3 present
1 Fred
3 present
1 Alice
3 missing
1 Carol
3 present
We can run the input file through multiple join commands in a then-chain:
$ mlr --icsv --opprint join -f multi-join/name-lookup.csv -j id then join -f multi-join/status-lookup.csv -j id multi-join/input.csv
id status name task
10 idle Bob chop
20 idle Carol puree
20 idle Carol wash
30 occupied Alice fold
10 idle Bob bake
20 idle Carol mix
10 idle Bob knead
30 occupied Alice clean
Bulk rename of fields
Suppose you want to replace spaces with underscores in your column names:
$ cat data/spaces.csv
a b c,def,g h i
123,4567,890
2468,1357,3579
9987,3312,4543
The simplest way is to use mlr rename with -g (for global
replace, not just first occurrence of space within each field) and -r
for pattern-matching (rather than explicit single-column renames):
The difference is a matter of taste (although mlr cat -n puts the counter first).
Options for dealing with duplicate rows
If your data has records appearing multiple times, you can use
mlr uniq to show and/or count the unique
records.
If you want to look at partial uniqueness — for example, show only
the first record for each unique combination of the account_id and
account_status fields — you might use mlr head -n 1 -g
account_id,account_status. Please also see mlr head.
Data-cleaning examples
Here are some ways to use the type-checking options as described in
the DSL reference.
Suppose you have the following data file, with inconsistent typing for boolean.
(Also imagine that, for the sake of discussion, we have a million-line file
rather than a four-line file, so we can’t see it all at once and some
automation is called for.)
$ mlr --tsv nest --explode --values --across-records -f b --nested-fs : data/nested.tsv
a b
x z
s u
s v
s w
$ mlr --tsv nest --explode --values --across-fields -f b --nested-fs : data/nested.tsv
a b_1
x z
a b_1 b_2 b_3
s u v w
While mlr nest is simplest, let’s also take a look at a few ways to do this using the
put DSL.
One option to split out the colon-delimited values in the b
column is to use splitnv to create an integer-indexed map and loop
over it, adding new fields to the current record:
$ mlr --from data/nested.tsv --itsv --oxtab put 'o=splitnv($b, ":"); for (k,v in o) {$["p".k]=v}'
a x
b z
p1 z
a s
b u:v:w
p1 u
p2 v
p3 w
while another is to loop over the same map from splitnv and use
it (with put -q to suppress printing the original record) to produce
multiple records:
$ mlr --from data/nested.tsv --itsv --oxtab put -q 'o=splitnv($b, ":"); for (k,v in o) {emit mapsum($*, {"b":v})}'
a x
b z
a s
b u
a s
b v
a s
b w
$ mlr --from data/nested.tsv --tsv put -q 'o=splitnv($b, ":"); for (k,v in o) {emit mapsum($*, {"b":v})}'
a b
x z
s u
s v
s w
Showing differences between successive queries
Suppose you have a database query which you run at one point in time, producing the output on the
left, then again later producing the output on the right:
And, suppose you want to compute the differences in the counters between
adjacent keys. Since the color names aren’t all in the same order, nor
are they all present on both sides, we can’t just paste the two files
side-by-side and do some column-four-minus-column-two arithmetic.
First, rename counter columns to make them distinct:
Then, join on the key field(s), and use unsparsify to zero-fill counters
absent on one side but present on the other. Use --ul and
--ur to emit unpaired records (namely, purple on the left and yellow on the right):
$ mlr --icsv --opprint \
join -j color --ul --ur -f data/prevtemp.csv \
then unsparsify --fill-with 0 \
then put '$count_delta = $current_count - $previous_count' \
data/currtemp.csv
color previous_count current_count count_delta
red 3472 3467 -5
orange 694 670 -24
yellow 0 27 27
blue 6838 6944 106
purple 12 0 -12
Finding missing dates
Suppose you have some date-stamped data which may (or may not) be missing entries for one or more dates:
Since there are 1372 lines in the data file, some automation is called for.
To find the missing dates, you can convert the dates to seconds since the epoch
using strptime, then compute adjacent differences (the cat -n
simply inserts record-counters):
$ mlr --from data/miss-date.csv --icsv \
cat -n \
then put '$datestamp = strptime($date, "%Y-%m-%d")' \
then step -a delta -f datestamp \
| head
n=1,date=2012-03-05,qoh=10055,datestamp=1330905600.000000,datestamp_delta=0
n=2,date=2012-03-06,qoh=10486,datestamp=1330992000.000000,datestamp_delta=86400.000000
n=3,date=2012-03-07,qoh=10430,datestamp=1331078400.000000,datestamp_delta=86400.000000
n=4,date=2012-03-08,qoh=10674,datestamp=1331164800.000000,datestamp_delta=86400.000000
n=5,date=2012-03-09,qoh=10880,datestamp=1331251200.000000,datestamp_delta=86400.000000
n=6,date=2012-03-10,qoh=10718,datestamp=1331337600.000000,datestamp_delta=86400.000000
n=7,date=2012-03-11,qoh=10795,datestamp=1331424000.000000,datestamp_delta=86400.000000
n=8,date=2012-03-12,qoh=11043,datestamp=1331510400.000000,datestamp_delta=86400.000000
n=9,date=2012-03-13,qoh=11177,datestamp=1331596800.000000,datestamp_delta=86400.000000
n=10,date=2012-03-14,qoh=11498,datestamp=1331683200.000000,datestamp_delta=86400.000000
Then, filter for adjacent difference not being 86400 (the number of seconds in a day):
$ mlr --from data/miss-date.csv --icsv \
cat -n \
then put '$datestamp = strptime($date, "%Y-%m-%d")' \
then step -a delta -f datestamp \
then filter '$datestamp_delta != 86400 && $n != 1'
n=774,date=2014-04-19,qoh=130140,datestamp=1397865600.000000,datestamp_delta=259200.000000
n=1119,date=2015-03-31,qoh=181625,datestamp=1427760000.000000,datestamp_delta=172800.000000
Given this, it’s now easy to see where the gaps are:
Miller is a streaming record processor; commands are performed once per
record. This makes Miller particularly suitable for single-pass algorithms,
allowing many of its verbs to process files that are (much) larger than the
amount of RAM present in your system. (Of course, Miller verbs such as
sort, tac, etc. all must ingest and retain all input records
before emitting any output records.) You can also use out-of-stream variables
to perform multi-pass computations, at the price of retaining all input records
in memory.
Two-pass algorithms: computation of percentages
For example, mapping numeric values down a column to the percentage
between their min and max values is two-pass: on the first pass you find the
min and max values, then on the second, map each record’s value to a
percentage.
$ mlr --from data/small --opprint put -q '
# These are executed once per record, which is the first pass.
# The key is to use NR to index an out-of-stream variable to
# retain all the x-field values.
@x_min = min($x, @x_min);
@x_max = max($x, @x_max);
@x[NR] = $x;
# The second pass is in a for-loop in an end-block.
end {
for (nr, x in @x) {
@x_pct[nr] = 100 * (x - @x_min) / (@x_max - @x_min);
}
emit (@x, @x_pct), "NR"
}
'
NR x x_pct
1 0.346790 25.661943
2 0.758680 100.000000
3 0.204603 0.000000
4 0.381399 31.908236
5 0.573289 66.540542
Two-pass algorithms: line-number ratios
Similarly, finding the total record count requires first reading through
all the data:
$ mlr --opprint --from data/small put -q '
@records[NR] = $*;
end {
for((I,k),v in @records) {
@records[I]["I"] = I;
@records[I]["N"] = NR;
@records[I]["PCT"] = 100*I/NR
}
emit @records,"I"
}
' then reorder -f I,N,PCT
I N PCT a b i x y
1 5 20 pan pan 1 0.3467901443380824 0.7268028627434533
2 5 40 eks pan 2 0.7586799647899636 0.5221511083334797
3 5 60 wye wye 3 0.20460330576630303 0.33831852551664776
4 5 80 eks wye 4 0.38139939387114097 0.13418874328430463
5 5 100 wye pan 5 0.5732889198020006 0.8636244699032729
Two-pass algorithms: records having max value
The idea is to retain records having the largest value of n in the
following data:
$ mlr --itsv --opprint cat data/maxrows.tsv
a b n score
purple red 5 0.743231
blue purple 2 0.093710
red purple 2 0.802103
purple red 5 0.389055
red purple 2 0.880457
orange red 2 0.540349
purple purple 1 0.634451
orange purple 5 0.257223
orange purple 5 0.693499
red red 4 0.981355
blue purple 5 0.157052
purple purple 1 0.441784
red purple 1 0.124912
orange blue 1 0.921944
blue purple 4 0.490909
purple red 5 0.454779
green purple 4 0.198278
orange blue 5 0.705700
red red 3 0.940705
purple red 5 0.072936
orange blue 3 0.389463
orange purple 2 0.664985
blue purple 1 0.371813
red purple 4 0.984571
green purple 5 0.203577
green purple 3 0.900873
purple purple 0 0.965677
blue purple 2 0.208785
purple purple 1 0.455077
red purple 4 0.477187
blue red 4 0.007487
Of course, the largest value of n isn’t known until after
all data have been read. Using an out-of-stream variable we can retain all
records as they are read, then filter them at the end:
$ cat data/maxrows.mlr
# Retain all records
@records[NR] = $*;
# Track max value of n
@maxn = max(@maxn, $n);
# After all records have been read, loop through retained records
# and print those with the max n value.
end {
for (int nr in @records) {
map record = @records[nr];
if (record["n"] == @maxn) {
emit record;
}
}
}
$ mlr --itsv --opprint put -q -f data/maxrows.mlr data/maxrows.tsv
a b n score
purple red 5 0.743231
purple red 5 0.389055
orange purple 5 0.257223
orange purple 5 0.693499
blue purple 5 0.157052
purple red 5 0.454779
orange blue 5 0.705700
purple red 5 0.072936
green purple 5 0.203577
Rectangularizing data
Suppose you have a method (in whatever language) which is printing things of the form
outer=1
outer=2
outer=3
and then calls another method which prints things of the form
The idea here is that middles starting with a 1 belong to the outer value of 1,
and so on. (For example, the outer values might be account IDs, the middle
values might be invoice IDs, and the inner values might be invoice line-items.)
If you want all the middle and inner lines to have the context of which outers
they belong to, you can modify your software to pass all those through your
methods. Alternatively, don’t refactor your code just to handle some
ad-hoc log-data formatting — instead, use the following to rectangularize
the data. The idea is to use an out-of-stream variable to accumulate fields
across records. Clear that variable when you see an outer ID; accumulate
fields; emit output when you see the inner IDs.
$ mlr --from data/rect.txt put -q '
is_present($outer) {
unset @r
}
for (k, v in $*) {
@r[k] = v
}
is_present($inner1) {
emit @r
}'
outer=1,middle=10,inner1=100,inner2=101
outer=1,middle=12,inner1=120,inner2=121
outer=2,middle=20,inner1=200,inner2=201
outer=2,middle=21,inner1=210,inner2=211
outer=3,middle=30,inner1=300,inner2=301
outer=3,middle=31,inner1=312,inner2=301
outer=3,middle=31,inner1=313,inner2=314
Regularizing ragged CSV
Miller handles compliant CSV: in particular, it’s an error if the
number of data fields in a given data line don’t match the number of
header lines. But in the event that you have a CSV file in which some lines
have less than the full number of fields, you can use Miller to pad them out.
The trick is to use NIDX format, for which each line stands on its own without
respect to a header line.
A reasonable question to ask is, how many occurrences of each field are
there? And, what percentage of total row count has each of them? Since the
denominator of the percentage is not known until the end, this is a two-pass
algorithm:
for (key in $*) {
@key_counts[key] += 1;
}
@record_count += 1;
end {
for (key in @key_counts) {
@key_fraction[key] = @key_counts[key] / @record_count
}
emit @record_count;
emit @key_counts, "key";
emit @key_fraction,"key"
}
The previous section discussed how to fill out missing data fields within
CSV with full header line — so the list of all field names is present
within the header line. Next, let’s look at a related problem: we have
data where each record has various key names but we want to produce rectangular
output having the union of all key names.
For example, suppose you have JSON input like this:
There are field names a, b, v, u,
x, w in the data — but not all in every record. Since
we don’t know the names of all the keys until we’ve read them all,
this needs to be a two-pass algorithm. On the first pass, remember all the
unique key names and all the records; on the second pass, loop through the
records filling in absent values, then producing output. Use put -q
since we don’t want to produce per-record output, only emitting output in
the end block:
$ cat data/unsparsify.mlr
# First pass:
# Remember all unique key names:
for (k in $*) {
@all_keys[k] = 1;
}
# Remember all input records:
@records[NR] = $*;
# Second pass:
end {
for (nr in @records) {
# Get the sparsely keyed input record:
irecord = @records[nr];
# Fill in missing keys with empty string:
map orecord = {};
for (k in @all_keys) {
if (haskey(irecord, k)) {
orecord[k] = irecord[k];
} else {
orecord[k] = "";
}
}
# Produce the output:
emit orecord;
}
}
$ mlr --ijson --opprint put -q -f data/unsparsify.mlr data/sparse.json
a b v u x w
1 2 3 - - -
- 2 - 1 - -
1 - 2 - 3 -
- - 1 - - 2
There is a keystroke-saving verb for this: mlr unsparsify.
Parsing log-file output
This, of course, depends highly on what’s in your log files. But, as
an example, suppose you have log-file lines such as
2015-10-08 08:29:09,445 INFO com.company.path.to.ClassName @ [sometext] various/sorts/of data {& punctuation} hits=1 status=0 time=2.378
I prefer to pre-filter with grep and/or sed to extract the structured text, then hand that to Miller. Example:
grep 'various sorts' *.log | sed 's/.*} //' | mlr --fs space --repifs --oxtab stats1 -a min,p10,p50,p90,max -f time -g status
Memoization with out-of-stream variables
The recursive function for the Fibonacci sequence is famous for its computational complexity.
Namely, using
f(0)=1,
f(1)=1,
f(n)=f(n-1)+f(n-2) for n≥2,
the evaluation tree branches left as well as right at each non-trivial level, resulting in millions
or more paths to the root 0/1 nodes for larger n. This program
mlr --ofmt '%.9lf' --opprint seqgen --start 1 --stop 28 then put '
func f(n) {
@fcount += 1; # count number of calls to the function
if (n < 2) {
return 1
} else {
return f(n-1) + f(n-2) # recurse
}
}
@fcount = 0;
$o = f($i);
$fcount = @fcount;
' then put '$seconds=systime()' then step -a delta -f seconds then cut -x -f seconds
Note that the time it takes to evaluate the function is blowing up exponentially as the input argument
increases. Using @-variables, which persist across records, we can cache and reuse the results
of previous computations:
mlr --ofmt '%.9lf' --opprint seqgen --start 1 --stop 28 then put '
func f(n) {
@fcount += 1; # count number of calls to the function
if (is_present(@fcache[n])) { # cache hit
return @fcache[n]
} else { # cache miss
num rv = 1;
if (n >= 2) {
rv = f(n-1) + f(n-2) # recurse
}
@fcache[n] = rv;
return rv
}
}
@fcount = 0;
$o = f($i);
$fcount = @fcount;
' then put '$seconds=systime()' then step -a delta -f seconds then cut -x -f seconds