Performance

Contents:
• Data
• Comparands
• Raw results
• Analysis
• Conclusion

Data

Test data were of the form

a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729

a,b,i,x,y
pan,pan,1,0.3467901443380824,0.7268028627434533
eks,pan,2,0.7586799647899636,0.5221511083334797
wye,wye,3,0.20460330576630303,0.33831852551664776
eks,wye,4,0.38139939387114097,0.13418874328430463
wye,pan,5,0.5732889198020006,0.8636244699032729

for DKVP and CSV, respectively, where fields a and b take one of five text values, uniformly distributed; i is a 1-up line counter; x and y are independent uniformly distributed floating-point numbers in the unit interval.

Data files of one million lines (totalling about 50MB for CSV and 60MB for DKVP) were used. In experiments not shown here, I also varied the file sizes; the size-dependent results were the expected, completely unsurprising linearities and so I produced no file-size-dependent plots for your viewing pleasure.

Comparands

The cat, cut, awk, sed, sort tools were compared to mlr on an 8-core Darwin laptop; RAM capacity was nowhere near challenged . The catc program is a simple line-oriented line-printer (source here) which is intermediate between Miller (which is record-aware as well as line-aware) and cat (which is only byte-aware).

Raw results

Note that for CSV data, the command is mlr --csvlite ... rather than mlr ....

   Mac     Mac         Comparand
   DKVP    CSV
  seconds seconds

   0.016   0.013       cat
   0.189   0.189       catc
   3.657   4.388       awk -F, '{print}'
   2.027   1.795       mlr cat

   2.292   1.940       cut -d , -f 1,4
   3.540   4.516       awk -F, '{print $1,$4}'
   1.600   1.390       mlr cut -f a,x
   1.694   1.648       mlr cut -x -f a,x

   0.845   0.643       sed -e 's/x=/EKS=/' -e 's/b=/BEE=/'
   2.076   1.842       mlr rename x,EKS,b,BEE

   5.643   5.031       awk -F, '{gsub("x=","",$4);gsub("y=","",$5);print $4+$5}'
   4.019   3.679       mlr put '$z=$x+$y'

   2.481   2.628       mlr stats1 -a mean -f x,y -g a,b

   2.587   2.389       mlr stats2 -a corr -f x,y -g a,b

  23.247  14.466       sort -t, -k 1,2
   3.023   5.658       mlr sort -f a,b

  17.224  15.523       sort -t, -k 4,5
   5.807   5.194       mlr sort -n x,y

Analysis

  • As expected, cat is very fast — it needs only stream bytes as quickly as possible; it doesn’t even need to touch individual bytes.
  • My catc is also faster than Miller: it needs to read and write lines, but it doesn’t segment lines into records; in fact it does no iteration over bytes in each line.
  • Miller does not outperform sed, which is string-oriented rather than record-oriented.
  • For the tools which do need to pick apart fields (cut, awk, sort), Miller is comparable or outperforms. As noted above, this effect persists linearly across file sizes.
  • For univariate and bivariate statistics, I didn’t attempt to compare to other tools wherein such computations are less straightforward; rather, I attempted only to show that Miller’s processing time here is comparable to its own processing time for other problems.

Conclusion

For record-oriented data transformations, Miller meets or beats the Unix toolkit in many contexts. Field renames in particular are worth doing as a pre-pipe or post-pipe using sed.