Cookbook

Parsing log-file output

This, of course, depends highly on what’s in your log files. But, as an example, suppose you have log-file lines such as

2015-10-08 08:29:09,445 INFO com.company.path.to.ClassName @ [sometext] various/sorts/of data {& punctuation} hits=1 status=0 time=2.378

I prefer to pre-filter with grep and/or sed to extract the structured text, then hand that to Miller. Example:

grep 'various sorts' *.log | sed 's/.*} //' | mlr --fs space --repifs --oxtab stats1 -a min,p10,p50,p90,max -f time -g status

Bulk rename of field names

$ cat spaces.csv
a b c,d e f
123,4567
890,2468
9987,3312

$ mlr --csv --rs lf rename -r -g ' ,_'  spaces.csv
a_b_c,d_e_f
123,4567
890,2468
9987,3312

$ mlr --csv --irs lf --opprint rename -r -g ' ,_'  spaces.csv
a_b_c d_e_f
123   4567
890   2468
9987  3312

Filtering paragraphs of text

The idea is to use a record separator which is a pair of newlines. Then, if you want each paragraph to be a record with a single value, use a field-separator which isn’t present in the input data (e.g. a control-A which is octal 001). Or, if you want each paragraph to have its lines as separate values, use newline as field separator.

$ cat paragraphs.txt
The quick brown fox jumped over the lazy dogs. The quick brown fox jumped
over the lazy dogs. The quick brown fox jumped over the lazy dogs. The quick
brown fox jumped over the lazy dogs. The quick brown fox jumped over the
lazy dogs.

Now is the time for all good people to come to the aid of their country.  Now
is the time for all good people to come to the aid of their country.  Now is
the time for all good people to come to the aid of their country.  Now is the
time for all good people to come to the aid of their country.  Now is the
time for all good people to come to the aid of their country.

Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow.
Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow.
Sphynx of black quartz, judge my vow.

The rain in Spain falls mainly on the plain. The rain in Spain falls mainly
on the plain. The rain in Spain falls mainly on the plain. The rain in Spain
falls mainly on the plain. The rain in Spain falls mainly on the plain. The
rain in Spain falls mainly on the plain. The rain in Spain falls mainly on
the plain. The rain in Spain falls mainly on the plain.

$ mlr --from paragraphs.txt --nidx --rs '\n\n' --fs '\001' filter '$1 =~ "the"'
The quick brown fox jumped over the lazy dogs. The quick brown fox jumped
over the lazy dogs. The quick brown fox jumped over the lazy dogs. The quick
brown fox jumped over the lazy dogs. The quick brown fox jumped over the
lazy dogs.

Now is the time for all good people to come to the aid of their country.  Now
is the time for all good people to come to the aid of their country.  Now is
the time for all good people to come to the aid of their country.  Now is the
time for all good people to come to the aid of their country.  Now is the
time for all good people to come to the aid of their country.

The rain in Spain falls mainly on the plain. The rain in Spain falls mainly
on the plain. The rain in Spain falls mainly on the plain. The rain in Spain
falls mainly on the plain. The rain in Spain falls mainly on the plain. The
rain in Spain falls mainly on the plain. The rain in Spain falls mainly on
the plain. The rain in Spain falls mainly on the plain.

$ mlr --from paragraphs.txt --nidx --rs '\n\n' --fs '\n' cut  -f 1,3
The quick brown fox jumped over the lazy dogs. The quick brown fox jumped
brown fox jumped over the lazy dogs. The quick brown fox jumped over the

Now is the time for all good people to come to the aid of their country.  Now
the time for all good people to come to the aid of their country.  Now is the

Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow.
Sphynx of black quartz, judge my vow.

The rain in Spain falls mainly on the plain. The rain in Spain falls mainly
falls mainly on the plain. The rain in Spain falls mainly on the plain. The

Doing arithmetic on fields with currency symbols

$ cat sample.csv
EventOccurred,EventType,Description,Status,PaymentType,NameonAccount,TransactionNumber,Amount
10/1/2015,Charged Back,Reason: Authorization Revoked By Customer,Disputed,Checking,John,1,$230.36
10/1/2015,Charged Back,Reason: Authorization Revoked By Customer,Disputed,Checking,Fred,2,$32.25
10/1/2015,Charged Back,Reason: Customer Advises Not Authorized,Disputed,Checking,Bob,3,$39.02
10/1/2015,Charged Back,Reason: Authorization Revoked By Customer,Disputed,Checking,Alice,4,$57.54
10/1/2015,Charged Back,Reason: Authorization Revoked By Customer,Disputed,Checking,Jungle,5,$230.36
10/1/2015,Charged Back,Reason: Payment Stopped,Disputed,Checking,Joe,6,$281.96
10/2/2015,Charged Back,Reason: Customer Advises Not Authorized,Disputed,Checking,Joseph,7,$188.19
10/2/2015,Charged Back,Reason: Customer Advises Not Authorized,Disputed,Checking,Joseph,8,$188.19
10/2/2015,Charged Back,Reason: Payment Stopped,Disputed,Checking,Anthony,9,$250.00

$ mlr --icsv --opprint cat sample.csv
EventOccurred EventType    Description                               Status   PaymentType NameonAccount TransactionNumber Amount
10/1/2015     Charged Back Reason: Authorization Revoked By Customer Disputed Checking    John          1                 $230.36
10/1/2015     Charged Back Reason: Authorization Revoked By Customer Disputed Checking    Fred          2                 $32.25
10/1/2015     Charged Back Reason: Customer Advises Not Authorized   Disputed Checking    Bob           3                 $39.02
10/1/2015     Charged Back Reason: Authorization Revoked By Customer Disputed Checking    Alice         4                 $57.54
10/1/2015     Charged Back Reason: Authorization Revoked By Customer Disputed Checking    Jungle        5                 $230.36
10/1/2015     Charged Back Reason: Payment Stopped                   Disputed Checking    Joe           6                 $281.96
10/2/2015     Charged Back Reason: Customer Advises Not Authorized   Disputed Checking    Joseph        7                 $188.19
10/2/2015     Charged Back Reason: Customer Advises Not Authorized   Disputed Checking    Joseph        8                 $188.19
10/2/2015     Charged Back Reason: Payment Stopped                   Disputed Checking    Anthony       9                 $250.00

$ mlr --csv put '$Amount = sub(string($Amount), "\$", "")' then stats1 -a sum -f Amount sample.csv
Amount_sum
1497.870000

$ mlr --csv --ofmt '%.2lf' put '$Amount = sub(string($Amount), "\$", "")' then stats1 -a sum -f Amount sample.csv
Amount_sum
1497.87

Program timing

This admittedly artificial example demonstrates using Miller time and stats functions to introspectly acquire some information about Miller’s own runtime. The delta function computes the difference between successive timestamps.

$ ruby -e '10000.times{|i|puts "i=#{i+1}"}' > lines.txt

$ head -n 5 lines.txt
i=1
i=2
i=3
i=4
i=5

mlr --ofmt '%.9le' --opprint put '$t=systime()' then step -a delta -f t lines.txt | head -n 7
i     t                 t_delta
1     1430603027.018016 1.430603027e+09
2     1430603027.018043 2.694129944e-05
3     1430603027.018048 5.006790161e-06
4     1430603027.018052 4.053115845e-06
5     1430603027.018055 2.861022949e-06
6     1430603027.018058 3.099441528e-06

mlr --ofmt '%.9le' --oxtab \
  put '$t=systime()' then \
  step -a delta -f t then \
  filter '$i>1' then \
  stats1 -a min,mean,max -f t_delta \
  lines.txt
t_delta_min  2.861022949e-06
t_delta_mean 4.077508505e-06
t_delta_max  5.388259888e-05

Using out-of-stream variables

One of Miller’s strengths is its compact notation: for example, given input of the form

$ head -n 5 ../data/medium
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729

you can simply do

$ mlr --oxtab stats1 -a sum -f x ../data/medium
x_sum 4986.019682

or

$ mlr --opprint stats1 -a sum -f x -g b ../data/medium
b   x_sum
pan 965.763670
wye 1023.548470
zee 979.742016
eks 1016.772857
hat 1000.192668

rather than the more tedious

$ mlr --oxtab put -q '
  @x_sum += $x;
  end {
    emit @x_sum
  }
' data/medium
x_sum 4986.019682

or

$ mlr --opprint put -q '
  @x_sum[$b] += $x;
  end {
    emit @x_sum, "b"
  }
' data/medium
b   x_sum
pan 965.763670
wye 1023.548470
zee 979.742016
eks 1016.772857
hat 1000.192668

The former (mlr stats1 et al.) has the advantages of being easier to type, being less error-prone to type, and running faster.

Nonetheless, out-of-stream variables (which I whimsically call oosvars), begin/end blocks, and emit statements give you the ability to implement logic — if you wish to do so — which isn’t present in other Miller verbs. (If you find yourself often using the same out-of-stream-variable logic over and over, please file a request at https://github.com/johnkerl/miller/issues to get it implemented directly in C as a Miller verb of its own.)

The following examples compute some things using oosvars which are already computable using Miller verbs, by way of providing food for thought.

Mean with/without oosvars

$ mlr stats1 -a mean -f x data/medium
x_mean=0.498602

$ mlr put -q '@x_sum += $x; @x_count += 1; end{@x_mean=@x_sum/@x_count; emit @x_mean}' data/medium
x_mean=0.498602

Variance and standard deviation with/without oosvars

$ mlr --oxtab stats1 -a count,sum,mean,var,stddev -f x data/medium
x_count  10000
x_sum    4986.019682
x_mean   0.498602
x_var    0.084270
x_stddev 0.290293

$ cat variance.mlr
@n += 1;
@sumx += $x;
@sumx2 += $x**2;
end {
  @mean = @sumx / @n;
  @var = (@sumx2 - @mean * (2 * @sumx - @n * @mean)) / (@n - 1);
  @stddev = sqrt(@var);
  emitf @n, @sumx, @sumx2, @mean, @var, @stddev
}

$ mlr --oxtab put -q -f variance.mlr data/medium
n      10000
sumx   4986.019682
sumx2  3328.652400
mean   0.498602
var    0.084270
stddev 0.290293

Min/max with/without oosvars

$ mlr --oxtab stats1 -a min,max -f x data/medium
x_min 0.000045
x_max 0.999953

$ mlr --oxtab put -q '@min = min(@min, $x); @max = max(@max, $x); end{emitf @min, @max}' data/medium
min 0.000045
max 0.999953

Delta with/without oosvars

$ mlr --opprint step -a delta -f x data/small
a   b   i x                   y                   x_delta
pan pan 1 0.3467901443380824  0.7268028627434533  0
eks pan 2 0.7586799647899636  0.5221511083334797  0.411890
wye wye 3 0.20460330576630303 0.33831852551664776 -0.554077
eks wye 4 0.38139939387114097 0.13418874328430463 0.176796
wye pan 5 0.5732889198020006  0.8636244699032729  0.191890

$ mlr --opprint put '$x_delta = ispresent(@last) ? $x - @last : 0; @last = $x' data/small
a   b   i x                   y                   x_delta
pan pan 1 0.3467901443380824  0.7268028627434533  0
eks pan 2 0.7586799647899636  0.5221511083334797  0.411890
wye wye 3 0.20460330576630303 0.33831852551664776 -0.554077
eks wye 4 0.38139939387114097 0.13418874328430463 0.176796
wye pan 5 0.5732889198020006  0.8636244699032729  0.191890

Keyed delta with/without oosvars

$ mlr --opprint step -a delta -f x -g a data/small
a   b   i x                   y                   x_delta
pan pan 1 0.3467901443380824  0.7268028627434533  0
eks pan 2 0.7586799647899636  0.5221511083334797  0
wye wye 3 0.20460330576630303 0.33831852551664776 0
eks wye 4 0.38139939387114097 0.13418874328430463 -0.377281
wye pan 5 0.5732889198020006  0.8636244699032729  0.368686

$ mlr --opprint put '$x_delta = ispresent(@last[$a]) ? $x - @last[$a] : 0; @last[$a]=$x' data/small
a   b   i x                   y                   x_delta
pan pan 1 0.3467901443380824  0.7268028627434533  0
eks pan 2 0.7586799647899636  0.5221511083334797  0
wye wye 3 0.20460330576630303 0.33831852551664776 0
eks wye 4 0.38139939387114097 0.13418874328430463 -0.377281
wye pan 5 0.5732889198020006  0.8636244699032729  0.368686

Exponentially weighted moving averages with/without oosvars

$ mlr --opprint step -a ewma -d 0.1 -f x data/small
a   b   i x                   y                   x_ewma_0.1
pan pan 1 0.3467901443380824  0.7268028627434533  0.346790
eks pan 2 0.7586799647899636  0.5221511083334797  0.387979
wye wye 3 0.20460330576630303 0.33831852551664776 0.369642
eks wye 4 0.38139939387114097 0.13418874328430463 0.370817
wye pan 5 0.5732889198020006  0.8636244699032729  0.391064

$ mlr --opprint put '
  begin{ @a=0.1 };
  $e = NR==1 ? $x : @a * $x + (1 - @a) * @e;
  @e=$e
' data/small
a   b   i x                   y                   e
pan pan 1 0.3467901443380824  0.7268028627434533  0.346790
eks pan 2 0.7586799647899636  0.5221511083334797  0.387979
wye wye 3 0.20460330576630303 0.33831852551664776 0.369642
eks wye 4 0.38139939387114097 0.13418874328430463 0.370817
wye pan 5 0.5732889198020006  0.8636244699032729  0.391064