Overview: • About Miller • File formats • Miller features in the context of the Unix toolkit • Record-heterogeneity • Internationalization Using Miller: • Reference • FAQ • Cookbook • Data examples • Installation, portability, dependencies, and testing • Documents by release Background: • Why C? • Why call it Miller? • How original is Miller? • Performance Repository: • Things to do • Contact information • GitHub repo |
• Bulk rename of field names • Filtering paragraphs of text • Doing arithmetic on fields with currency symbols • Program timing • Using out-of-stream variables • Mean with/without oosvars • Variance and standard deviation with/without oosvars • Min/max with/without oosvars • Delta with/without oosvars • Keyed delta with/without oosvars • Exponentially weighted moving averages with/without oosvars Parsing log-file outputThis, of course, depends highly on what’s in your log files. But, as an example, suppose you have log-file lines such as2015-10-08 08:29:09,445 INFO com.company.path.to.ClassName @ [sometext] various/sorts/of data {& punctuation} hits=1 status=0 time=2.378 grep 'various sorts' *.log | sed 's/.*} //' | mlr --fs space --repifs --oxtab stats1 -a min,p10,p50,p90,max -f time -g status Bulk rename of field names$ cat spaces.csv a b c,d e f 123,4567 890,2468 9987,3312 $ mlr --csv --rs lf rename -r -g ' ,_' spaces.csv a_b_c,d_e_f 123,4567 890,2468 9987,3312 $ mlr --csv --irs lf --opprint rename -r -g ' ,_' spaces.csv a_b_c d_e_f 123 4567 890 2468 9987 3312 Filtering paragraphs of textThe idea is to use a record separator which is a pair of newlines. Then, if you want each paragraph to be a record with a single value, use a field-separator which isn’t present in the input data (e.g. a control-A which is octal 001). Or, if you want each paragraph to have its lines as separate values, use newline as field separator.$ cat paragraphs.txt The quick brown fox jumped over the lazy dogs. The quick brown fox jumped over the lazy dogs. The quick brown fox jumped over the lazy dogs. The quick brown fox jumped over the lazy dogs. The quick brown fox jumped over the lazy dogs. Now is the time for all good people to come to the aid of their country. Now is the time for all good people to come to the aid of their country. Now is the time for all good people to come to the aid of their country. Now is the time for all good people to come to the aid of their country. Now is the time for all good people to come to the aid of their country. Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. $ mlr --from paragraphs.txt --nidx --rs '\n\n' --fs '\001' filter '$1 =~ "the"' The quick brown fox jumped over the lazy dogs. The quick brown fox jumped over the lazy dogs. The quick brown fox jumped over the lazy dogs. The quick brown fox jumped over the lazy dogs. The quick brown fox jumped over the lazy dogs. Now is the time for all good people to come to the aid of their country. Now is the time for all good people to come to the aid of their country. Now is the time for all good people to come to the aid of their country. Now is the time for all good people to come to the aid of their country. Now is the time for all good people to come to the aid of their country. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. $ mlr --from paragraphs.txt --nidx --rs '\n\n' --fs '\n' cut -f 1,3 The quick brown fox jumped over the lazy dogs. The quick brown fox jumped brown fox jumped over the lazy dogs. The quick brown fox jumped over the Now is the time for all good people to come to the aid of their country. Now the time for all good people to come to the aid of their country. Now is the Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly falls mainly on the plain. The rain in Spain falls mainly on the plain. The Doing arithmetic on fields with currency symbols$ cat sample.csv EventOccurred,EventType,Description,Status,PaymentType,NameonAccount,TransactionNumber,Amount 10/1/2015,Charged Back,Reason: Authorization Revoked By Customer,Disputed,Checking,John,1,$230.36 10/1/2015,Charged Back,Reason: Authorization Revoked By Customer,Disputed,Checking,Fred,2,$32.25 10/1/2015,Charged Back,Reason: Customer Advises Not Authorized,Disputed,Checking,Bob,3,$39.02 10/1/2015,Charged Back,Reason: Authorization Revoked By Customer,Disputed,Checking,Alice,4,$57.54 10/1/2015,Charged Back,Reason: Authorization Revoked By Customer,Disputed,Checking,Jungle,5,$230.36 10/1/2015,Charged Back,Reason: Payment Stopped,Disputed,Checking,Joe,6,$281.96 10/2/2015,Charged Back,Reason: Customer Advises Not Authorized,Disputed,Checking,Joseph,7,$188.19 10/2/2015,Charged Back,Reason: Customer Advises Not Authorized,Disputed,Checking,Joseph,8,$188.19 10/2/2015,Charged Back,Reason: Payment Stopped,Disputed,Checking,Anthony,9,$250.00 $ mlr --icsv --opprint cat sample.csv EventOccurred EventType Description Status PaymentType NameonAccount TransactionNumber Amount 10/1/2015 Charged Back Reason: Authorization Revoked By Customer Disputed Checking John 1 $230.36 10/1/2015 Charged Back Reason: Authorization Revoked By Customer Disputed Checking Fred 2 $32.25 10/1/2015 Charged Back Reason: Customer Advises Not Authorized Disputed Checking Bob 3 $39.02 10/1/2015 Charged Back Reason: Authorization Revoked By Customer Disputed Checking Alice 4 $57.54 10/1/2015 Charged Back Reason: Authorization Revoked By Customer Disputed Checking Jungle 5 $230.36 10/1/2015 Charged Back Reason: Payment Stopped Disputed Checking Joe 6 $281.96 10/2/2015 Charged Back Reason: Customer Advises Not Authorized Disputed Checking Joseph 7 $188.19 10/2/2015 Charged Back Reason: Customer Advises Not Authorized Disputed Checking Joseph 8 $188.19 10/2/2015 Charged Back Reason: Payment Stopped Disputed Checking Anthony 9 $250.00 $ mlr --csv put '$Amount = sub(string($Amount), "\$", "")' then stats1 -a sum -f Amount sample.csv Amount_sum 1497.870000 $ mlr --csv --ofmt '%.2lf' put '$Amount = sub(string($Amount), "\$", "")' then stats1 -a sum -f Amount sample.csv Amount_sum 1497.87 Program timingThis admittedly artificial example demonstrates using Miller time and stats functions to introspectly acquire some information about Miller’s own runtime. The delta function computes the difference between successive timestamps.$ ruby -e '10000.times{|i|puts "i=#{i+1}"}' > lines.txt $ head -n 5 lines.txt i=1 i=2 i=3 i=4 i=5 mlr --ofmt '%.9le' --opprint put '$t=systime()' then step -a delta -f t lines.txt | head -n 7 i t t_delta 1 1430603027.018016 1.430603027e+09 2 1430603027.018043 2.694129944e-05 3 1430603027.018048 5.006790161e-06 4 1430603027.018052 4.053115845e-06 5 1430603027.018055 2.861022949e-06 6 1430603027.018058 3.099441528e-06 mlr --ofmt '%.9le' --oxtab \ put '$t=systime()' then \ step -a delta -f t then \ filter '$i>1' then \ stats1 -a min,mean,max -f t_delta \ lines.txt t_delta_min 2.861022949e-06 t_delta_mean 4.077508505e-06 t_delta_max 5.388259888e-05 Using out-of-stream variablesOne of Miller’s strengths is its compact notation: for example, given input of the form$ head -n 5 ../data/medium a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533 a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797 a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776 a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463 a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729 $ mlr --oxtab stats1 -a sum -f x ../data/medium x_sum 4986.019682 $ mlr --opprint stats1 -a sum -f x -g b ../data/medium b x_sum pan 965.763670 wye 1023.548470 zee 979.742016 eks 1016.772857 hat 1000.192668 $ mlr --oxtab put -q ' @x_sum += $x; end { emit @x_sum } ' data/medium x_sum 4986.019682 $ mlr --opprint put -q ' @x_sum[$b] += $x; end { emit @x_sum, "b" } ' data/medium b x_sum pan 965.763670 wye 1023.548470 zee 979.742016 eks 1016.772857 hat 1000.192668 Mean with/without oosvars$ mlr stats1 -a mean -f x data/medium x_mean=0.498602 $ mlr put -q '@x_sum += $x; @x_count += 1; end{@x_mean=@x_sum/@x_count; emit @x_mean}' data/medium x_mean=0.498602 Variance and standard deviation with/without oosvars$ mlr --oxtab stats1 -a count,sum,mean,var,stddev -f x data/medium x_count 10000 x_sum 4986.019682 x_mean 0.498602 x_var 0.084270 x_stddev 0.290293 $ cat variance.mlr @n += 1; @sumx += $x; @sumx2 += $x**2; end { @mean = @sumx / @n; @var = (@sumx2 - @mean * (2 * @sumx - @n * @mean)) / (@n - 1); @stddev = sqrt(@var); emitf @n, @sumx, @sumx2, @mean, @var, @stddev } $ mlr --oxtab put -q -f variance.mlr data/medium n 10000 sumx 4986.019682 sumx2 3328.652400 mean 0.498602 var 0.084270 stddev 0.290293 Min/max with/without oosvars$ mlr --oxtab stats1 -a min,max -f x data/medium x_min 0.000045 x_max 0.999953 $ mlr --oxtab put -q '@min = min(@min, $x); @max = max(@max, $x); end{emitf @min, @max}' data/medium min 0.000045 max 0.999953 Delta with/without oosvars$ mlr --opprint step -a delta -f x data/small a b i x y x_delta pan pan 1 0.3467901443380824 0.7268028627434533 0 eks pan 2 0.7586799647899636 0.5221511083334797 0.411890 wye wye 3 0.20460330576630303 0.33831852551664776 -0.554077 eks wye 4 0.38139939387114097 0.13418874328430463 0.176796 wye pan 5 0.5732889198020006 0.8636244699032729 0.191890 $ mlr --opprint put '$x_delta = ispresent(@last) ? $x - @last : 0; @last = $x' data/small a b i x y x_delta pan pan 1 0.3467901443380824 0.7268028627434533 0 eks pan 2 0.7586799647899636 0.5221511083334797 0.411890 wye wye 3 0.20460330576630303 0.33831852551664776 -0.554077 eks wye 4 0.38139939387114097 0.13418874328430463 0.176796 wye pan 5 0.5732889198020006 0.8636244699032729 0.191890 Keyed delta with/without oosvars$ mlr --opprint step -a delta -f x -g a data/small a b i x y x_delta pan pan 1 0.3467901443380824 0.7268028627434533 0 eks pan 2 0.7586799647899636 0.5221511083334797 0 wye wye 3 0.20460330576630303 0.33831852551664776 0 eks wye 4 0.38139939387114097 0.13418874328430463 -0.377281 wye pan 5 0.5732889198020006 0.8636244699032729 0.368686 $ mlr --opprint put '$x_delta = ispresent(@last[$a]) ? $x - @last[$a] : 0; @last[$a]=$x' data/small a b i x y x_delta pan pan 1 0.3467901443380824 0.7268028627434533 0 eks pan 2 0.7586799647899636 0.5221511083334797 0 wye wye 3 0.20460330576630303 0.33831852551664776 0 eks wye 4 0.38139939387114097 0.13418874328430463 -0.377281 wye pan 5 0.5732889198020006 0.8636244699032729 0.368686 Exponentially weighted moving averages with/without oosvars$ mlr --opprint step -a ewma -d 0.1 -f x data/small a b i x y x_ewma_0.1 pan pan 1 0.3467901443380824 0.7268028627434533 0.346790 eks pan 2 0.7586799647899636 0.5221511083334797 0.387979 wye wye 3 0.20460330576630303 0.33831852551664776 0.369642 eks wye 4 0.38139939387114097 0.13418874328430463 0.370817 wye pan 5 0.5732889198020006 0.8636244699032729 0.391064 $ mlr --opprint put ' begin{ @a=0.1 }; $e = NR==1 ? $x : @a * $x + (1 - @a) * @e; @e=$e ' data/small a b i x y e pan pan 1 0.3467901443380824 0.7268028627434533 0.346790 eks pan 2 0.7586799647899636 0.5221511083334797 0.387979 wye wye 3 0.20460330576630303 0.33831852551664776 0.369642 eks wye 4 0.38139939387114097 0.13418874328430463 0.370817 wye pan 5 0.5732889198020006 0.8636244699032729 0.391064 |