Quick links: Flags Verbs Functions Glossary Release docs

Two-pass algorithms¶

Overview¶

Miller is a streaming record processor; commands are performed once per record. (See here and here for an introductory discussion.) This makes Miller particularly suitable for single-pass algorithms, allowing many of its verbs to process files that are (much) larger than the amount of RAM present in your system. (Of course, Miller verbs such as sort, tac, etc. all must ingest and retain all input records before emitting any output records -- see the page on streaming processing and memory usage.) You can also use out-of-stream variables to perform multi-pass computations, at the price of retaining all input records in memory.

One of Miller's strengths is its compact notation: for example, given input of the form

head -n 5 ./data/medium

a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729

you can simply do

mlr --oxtab stats1 -a sum -f x ./data/medium

x_sum 4986.019681679581

or

mlr --opprint stats1 -a sum -f x -g b ./data/medium

b   x_sum
pan 965.7636699425815
wye 1023.5484702619565
zee 979.7420161495838
eks 1016.7728571314786
hat 1000.192668193983

rather than the more tedious

mlr --oxtab put -q '
  @x_sum += $x;
  end {
    emit @x_sum
  }
' data/medium

x_sum 4986.019681679581

or

mlr --opprint put -q '
  @x_sum[$b] += $x;
  end {
    emit @x_sum, "b"
  }
' data/medium

b   x_sum
pan 965.7636699425815
wye 1023.5484702619565
zee 979.7420161495838
eks 1016.7728571314786
hat 1000.192668193983

The former (mlr stats1 et al.) has the advantages of being easier to type, being less error-prone to type, and running faster.

Nonetheless, out-of-stream variables (which I whimsically call oosvars), begin/end blocks, and emit statements give you the ability to implement logic -- if you wish to do so -- which isn't present in other Miller verbs. (If you find yourself often using the same out-of-stream-variable logic over and over, please file a request at https://github.com/johnkerl/miller/issues to get it implemented directly in Go as a Miller verb of its own.)

The following examples compute some things using oosvars which are already computable using Miller verbs, by way of providing food for thought.

Computation of percentages¶

For example, mapping numeric values down a column to the percentage between their min and max values is two-pass: on the first pass you find the min and max values, then on the second, map each record's value to a percentage.

mlr --from data/small --opprint put -q '
  # These are executed once per record, which is the first pass.
  # The key is to use NR to index an out-of-stream variable to
  # retain all the x-field values.
  @x_min = min($x, @x_min);
  @x_max = max($x, @x_max);
  @x[NR] = $x;

  # The second pass is in a for-loop in an end-block.
  end {
    for (nr, x in @x) {
      @x_pct[nr] = 100 * (x - @x_min) / (@x_max - @x_min);
    }
    emit (@x, @x_pct), "NR"
  }
'

NR x        x_pct
1  0.346791 25.66218352716956
2  0.758679 100
3  0.204603 0
4  0.381399 31.90825807289974
5  0.573288 66.54051068806446

Line-number ratios¶

Similarly, finding the total record count requires first reading through all the data:

mlr --opprint --from data/small put -q '
  @records[NR] = $*;
  end {
    for((Istring,k),v in @records) {
      int I = int(Istring);
      @records[I]["I"] = I;
      @records[I]["N"] = NR;
      @records[I]["PCT"] = 100*I/NR
    }
    emit @records,"I"
  }
' then reorder -f I,N,PCT

I N PCT a   b   i x        y
1 5 20  pan pan 1 0.346791 0.726802
2 5 40  eks pan 2 0.758679 0.522151
3 5 60  wye wye 3 0.204603 0.338318
4 5 80  eks wye 4 0.381399 0.134188
5 5 100 wye pan 5 0.573288 0.863624

Records having max value¶

The idea is to retain records having the largest value of n in the following data:

mlr --itsv --opprint cat data/maxrows.tsv

a      b      n score
purple red    5 0.743231
blue   purple 2 0.093710
red    purple 2 0.802103
purple red    5 0.389055
red    purple 2 0.880457
orange red    2 0.540349
purple purple 1 0.634451
orange purple 5 0.257223
orange purple 5 0.693499
red    red    4 0.981355
blue   purple 5 0.157052
purple purple 1 0.441784
red    purple 1 0.124912
orange blue   1 0.921944
blue   purple 4 0.490909
purple red    5 0.454779
green  purple 4 0.198278
orange blue   5 0.705700
red    red    3 0.940705
purple red    5 0.072936
orange blue   3 0.389463
orange purple 2 0.664985
blue   purple 1 0.371813
red    purple 4 0.984571
green  purple 5 0.203577
green  purple 3 0.900873
purple purple 0 0.965677
blue   purple 2 0.208785
purple purple 1 0.455077
red    purple 4 0.477187
blue   red    4 0.007487

Of course, the largest value of n isn't known until after all data have been read. Using an out-of-stream variable we can retain all records as they are read, then filter them at the end:

cat data/maxrows.mlr

# Retain all records
@records[NR] = $*;
# Track max value of n
@maxn = max(@maxn, $n);

# After all records have been read, loop through retained records
# and print those with the max n value.
end {
  for (nr in @records) {
    map record = @records[nr];
    if (record["n"] == @maxn) {
      emit record;
    }
  }
}

mlr --itsv --opprint put -q -f data/maxrows.mlr data/maxrows.tsv

a      b      n score
purple red    5 0.743231
purple red    5 0.389055
orange purple 5 0.257223
orange purple 5 0.693499
blue   purple 5 0.157052
purple red    5 0.454779
orange blue   5 0.705700
purple red    5 0.072936
green  purple 5 0.203577

Feature-counting¶

Suppose you have some heterogeneous data like this:

{ "qoh": 29874, "rate": 1.68, "latency": 0.02 }
{ "name": "alice", "uid": 572 }
{ "qoh": 1227, "rate": 1.01, "latency": 0.07 }
{ "qoh": 13458, "rate": 1.72, "latency": 0.04 }
{ "qoh": 56782, "rate": 1.64 }
{ "qoh": 23512, "rate": 1.71, "latency": 0.03 }
{ "qoh": 9876, "rate": 1.89, "latency": 0.08 }
{ "name": "bill", "uid": 684 }
{ "name": "chuck", "uid2": 908 }
{ "name": "dottie", "uid": 440 }
{ "qoh": 0, "rate": 0.40, "latency": 0.01 }
{ "qoh": 5438, "rate": 1.56, "latency": 0.17 }

A reasonable question to ask is, how many occurrences of each field are there? And, what percentage of total row count has each of them? Since the denominator of the percentage is not known until the end, this is a two-pass algorithm:

for (key in $*) {
  @key_counts[key] += 1;
}
@record_count += 1;

end {
  for (key in @key_counts) {
      @key_fraction[key] = @key_counts[key] / @record_count
  }
  emit @record_count;
  emit @key_counts, "key";
  emit @key_fraction,"key"
}

Then

mlr --json put -q -f data/feature-count.mlr data/features.json

[
{
  "record_count": 12
},
{
  "key": "qoh",
  "key_counts": 8
},
{
  "key": "rate",
  "key_counts": 8
},
{
  "key": "latency",
  "key_counts": 7
},
{
  "key": "name",
  "key_counts": 4
},
{
  "key": "uid",
  "key_counts": 3
},
{
  "key": "uid2",
  "key_counts": 1
},
{
  "key": "qoh",
  "key_fraction": 0.6666666666666666
},
{
  "key": "rate",
  "key_fraction": 0.6666666666666666
},
{
  "key": "latency",
  "key_fraction": 0.5833333333333334
},
{
  "key": "name",
  "key_fraction": 0.3333333333333333
},
{
  "key": "uid",
  "key_fraction": 0.25
},
{
  "key": "uid2",
  "key_fraction": 0.08333333333333333
}
]

mlr --ijson --opprint put -q -f data/feature-count.mlr data/features.json

record_count
12

key     key_counts
qoh     8
rate    8
latency 7
name    4
uid     3
uid2    1

key     key_fraction
qoh     0.6666666666666666
rate    0.6666666666666666
latency 0.5833333333333334
name    0.3333333333333333
uid     0.25
uid2    0.08333333333333333

Unsparsing¶

The previous section discussed how to fill out missing data fields within CSV with full header line -- so the list of all field names is present within the header line. Next, let's look at a related problem: we have data where each record has various key names but we want to produce rectangular output having the union of all key names.

There is a keystroke-saving verb for this: unsparsify. Here, we look at how to implement that in the DSL.

For example, suppose you have JSON input like this:

cat data/sparse.json

{"a":1,"b":2,"v":3}
{"u":1,"b":2}
{"a":1,"v":2,"x":3}
{"v":1,"w":2}

There are field names a, b, v, u, x, w in the data -- but not all in every record. Since we don't know the names of all the keys until we've read them all, this needs to be a two-pass algorithm. On the first pass, remember all the unique key names and all the records; on the second pass, loop through the records filling in absent values, then producing output. Use put -q since we don't want to produce per-record output, only emitting output in the end block:

cat data/unsparsify.mlr

# First pass:
# Remember all unique key names:
for (k in $*) {
  @all_keys[k] = 1;
}
# Remember all input records:
@records[NR] = $*;

# Second pass:
end {
  for (nr in @records) {
    # Get the sparsely keyed input record:
    irecord = @records[nr];
    # Fill in missing keys with empty string:
    map orecord = {};
    for (k in @all_keys) {
      if (haskey(irecord, k)) {
        orecord[k] = irecord[k];
      } else {
        orecord[k] = "";
      }
    }
    # Produce the output:
    emit orecord;
  }
}

mlr --json put -q -f data/unsparsify.mlr data/sparse.json

[
{
  "a": 1,
  "b": 2,
  "v": 3,
  "u": "",
  "x": "",
  "w": ""
},
{
  "a": "",
  "b": 2,
  "v": "",
  "u": 1,
  "x": "",
  "w": ""
},
{
  "a": 1,
  "b": "",
  "v": 2,
  "u": "",
  "x": 3,
  "w": ""
},
{
  "a": "",
  "b": "",
  "v": 1,
  "u": "",
  "x": "",
  "w": 2
}
]

mlr --ijson --ocsv put -q -f data/unsparsify.mlr data/sparse.json

a,b,v,u,x,w
1,2,3,,,
,2,,1,,
1,,2,,3,
,,1,,,2

mlr --ijson --opprint put -q -f data/unsparsify.mlr data/sparse.json

a b v u x w
1 2 3 - - -
- 2 - 1 - -
1 - 2 - 3 -
- - 1 - - 2

Mean without/with oosvars¶

mlr --opprint stats1 -a mean -f x data/medium

x_mean
0.49860196816795804

mlr --opprint put -q '
  @x_sum += $x;
  @x_count += 1;
  end {
    @x_mean = @x_sum / @x_count;
    emit @x_mean
  }
' data/medium

x_mean
0.49860196816795804

Keyed mean without/with oosvars¶

mlr --opprint stats1 -a mean -f x -g a,b data/medium

a   b   x_mean
pan pan 0.5133141190437597
eks pan 0.48507555383425127
wye wye 0.49150092785839306
eks wye 0.4838950517724162
wye pan 0.4996119901034838
zee pan 0.5198298297816007
eks zee 0.49546320772681596
zee wye 0.5142667998230479
hat wye 0.49381326184632596
pan wye 0.5023618498923658
zee eks 0.4883932942792647
hat zee 0.5099985721987774
hat eks 0.48587864619953547
wye hat 0.4977304763723314
pan eks 0.5036718595143479
eks eks 0.5227992666570941
hat hat 0.47993053101017374
hat pan 0.4643355557376876
zee zee 0.5127559183726382
pan hat 0.492140950155604
pan zee 0.4966041598627583
zee hat 0.46772617655014515
wye zee 0.5059066170573692
eks hat 0.5006790659966355
wye eks 0.5306035254809106

mlr --opprint put -q '
  @x_sum[$a][$b] += $x;
  @x_count[$a][$b] += 1;
  end{
    for ((a, b), v in @x_sum) {
      @x_mean[a][b] = @x_sum[a][b] / @x_count[a][b];
    }
    emit @x_mean, "a", "b"
  }
' data/medium

a   b   x_mean
pan pan 0.5133141190437597
pan wye 0.5023618498923658
pan eks 0.5036718595143479
pan hat 0.492140950155604
pan zee 0.4966041598627583
eks pan 0.48507555383425127
eks wye 0.4838950517724162
eks zee 0.49546320772681596
eks eks 0.5227992666570941
eks hat 0.5006790659966355
wye wye 0.49150092785839306
wye pan 0.4996119901034838
wye hat 0.4977304763723314
wye zee 0.5059066170573692
wye eks 0.5306035254809106
zee pan 0.5198298297816007
zee wye 0.5142667998230479
zee eks 0.4883932942792647
zee zee 0.5127559183726382
zee hat 0.46772617655014515
hat wye 0.49381326184632596
hat zee 0.5099985721987774
hat eks 0.48587864619953547
hat hat 0.47993053101017374
hat pan 0.4643355557376876

Variance and standard deviation without/with oosvars¶

mlr --oxtab stats1 -a count,sum,mean,var,stddev -f x data/medium

x_count  10000
x_sum    4986.019681679581
x_mean   0.49860196816795804
x_var    0.08426974433144456
x_stddev 0.2902925151144007

cat variance.mlr

@n += 1;
@sumx += $x;
@sumx2 += $x**2;
end {
  @mean = @sumx / @n;
  @var = (@sumx2 - @mean * (2 * @sumx - @n * @mean)) / (@n - 1);
  @stddev = sqrt(@var);
  emitf @n, @sumx, @sumx2, @mean, @var, @stddev
}

mlr --oxtab put -q -f variance.mlr data/medium

n      10000
sumx   4986.019681679581
sumx2  3328.652400179729
mean   0.49860196816795804
var    0.08426974433144456
stddev 0.2902925151144007

You can also do this keyed, of course, imitating the keyed-mean example above.

Min/max without/with oosvars¶

mlr --oxtab stats1 -a min,max -f x data/medium

x_min 0.00004509679127584487
x_max 0.999952670371898

mlr --oxtab put -q '
  @x_min = min(@x_min, $x);
  @x_max = max(@x_max, $x);
  end{emitf @x_min, @x_max}
' data/medium

x_min 0.00004509679127584487
x_max 0.999952670371898

Keyed min/max without/with oosvars¶

mlr --opprint stats1 -a min,max -f x -g a data/medium

a   x_min                  x_max
pan 0.00020390740306253097 0.9994029107062516
eks 0.0006917972627396018  0.9988110946859143
wye 0.0001874794831505655  0.9998228522652893
zee 0.0005486114815762555  0.9994904324789629
hat 0.00004509679127584487 0.999952670371898

mlr --opprint --from data/medium put -q '
  @min[$a] = min(@min[$a], $x);
  @max[$a] = max(@max[$a], $x);
  end{
    emit (@min, @max), "a";
  }
'

a   min                    max
pan 0.00020390740306253097 0.9994029107062516
eks 0.0006917972627396018  0.9988110946859143
wye 0.0001874794831505655  0.9998228522652893
zee 0.0005486114815762555  0.9994904324789629
hat 0.00004509679127584487 0.999952670371898

Delta without/with oosvars¶

mlr --opprint step -a delta -f x data/small

a   b   i x        y        x_delta
pan pan 1 0.346791 0.726802 0
eks pan 2 0.758679 0.522151 0.411888
wye wye 3 0.204603 0.338318 -0.554076
eks wye 4 0.381399 0.134188 0.17679599999999998
wye pan 5 0.573288 0.863624 0.19188900000000003

mlr --opprint put '
  $x_delta = is_present(@last) ? $x - @last : 0;
  @last = $x
' data/small

a   b   i x        y        x_delta
pan pan 1 0.346791 0.726802 0
eks pan 2 0.758679 0.522151 0.411888
wye wye 3 0.204603 0.338318 -0.554076
eks wye 4 0.381399 0.134188 0.17679599999999998
wye pan 5 0.573288 0.863624 0.19188900000000003

Keyed delta without/with oosvars¶

mlr --opprint step -a delta -f x -g a data/small

a   b   i x        y        x_delta
pan pan 1 0.346791 0.726802 0
eks pan 2 0.758679 0.522151 0
wye wye 3 0.204603 0.338318 0
eks wye 4 0.381399 0.134188 -0.37728
wye pan 5 0.573288 0.863624 0.36868500000000004

mlr --opprint put '
  $x_delta = is_present(@last[$a]) ? $x - @last[$a] : 0;
  @last[$a]=$x
' data/small

a   b   i x        y        x_delta
pan pan 1 0.346791 0.726802 0
eks pan 2 0.758679 0.522151 0
wye wye 3 0.204603 0.338318 0
eks wye 4 0.381399 0.134188 -0.37728
wye pan 5 0.573288 0.863624 0.36868500000000004

Exponentially weighted moving averages without/with oosvars¶

mlr --opprint step -a ewma -d 0.1 -f x data/small

a   b   i x        y        x_ewma_0.1
pan pan 1 0.346791 0.726802 0.346791
eks pan 2 0.758679 0.522151 0.3879798
wye wye 3 0.204603 0.338318 0.36964211999999996
eks wye 4 0.381399 0.134188 0.37081780799999997
wye pan 5 0.573288 0.863624 0.3910648272

mlr --opprint put '
  begin{ @a=0.1 };
  $e = NR==1 ? $x : @a * $x + (1 - @a) * @e;
  @e=$e
' data/small

a   b   i x        y        e
pan pan 1 0.346791 0.726802 0.346791
eks pan 2 0.758679 0.522151 0.3879798
wye wye 3 0.204603 0.338318 0.36964211999999996
eks wye 4 0.381399 0.134188 0.37081780799999997
wye pan 5 0.573288 0.863624 0.3910648272