• Data
• Comparands
• Raw results
• Analysis
• Conclusion


Test data were of the form



for DKVP and CSV, respectively, where fields a and b take one of five text values, uniformly distributed; i is a 1-up line counter; x and y are independent uniformly distributed floating-point numbers in the unit interval.

Data files of one million lines (totalling about 50MB for CSV and 60MB for DKVP) were used. In experiments not shown here, I also varied the file sizes; the size-dependent results were the expected, completely unsurprising linearities and so I produced no file-size-dependent plots for your viewing pleasure.


Raw results

Note that for CSV data, the command is mlr --csvlite ... rather than mlr ....

   Mac     Mac         Comparand
   DKVP    CSV
  seconds seconds

   0.016   0.013       cat
   0.189   0.189       catc
   3.657   4.388       awk -F, '{print}'
   2.027   1.795       mlr cat

   2.292   1.940       cut -d , -f 1,4
   3.540   4.516       awk -F, '{print $1,$4}'
   1.600   1.390       mlr cut -f a,x
   1.694   1.648       mlr cut -x -f a,x

   0.845   0.643       sed -e 's/x=/EKS=/' -e 's/b=/BEE=/'
   2.076   1.842       mlr rename x,EKS,b,BEE

   5.643   5.031       awk -F, '{gsub("x=","",$4);gsub("y=","",$5);print $4+$5}'
   4.019   3.679       mlr put '$z=$x+$y'

   2.481   2.628       mlr stats1 -a mean -f x,y -g a,b

   2.587   2.389       mlr stats2 -a corr -f x,y -g a,b

  23.247  14.466       sort -t, -k 1,2
   3.023   5.658       mlr sort -f a,b

  17.224  15.523       sort -t, -k 4,5
   5.807   5.194       mlr sort -n x,y


  • As expected, cat is very fast — it needs only stream bytes as quickly as possible; it doesn’t even need to touch individual bytes.
  • My catc is also faster than Miller: it needs to read and write lines, but it doesn’t segment lines into records; in fact it does no iteration over bytes in each line.
  • Miller does not outperform sed, which is string-oriented rather than record-oriented.
  • For the tools which do need to pick apart fields (cut, awk, sort), Miller is comparable or outperforms. As noted above, this effect persists linearly across file sizes.
  • For univariate and bivariate statistics, I didn’t attempt to compare to other tools wherein such computations are less straightforward; rather, I attempted only to show that Miller’s processing time here is comparable to its own processing time for other problems.


For record-oriented data transformations, Miller meets or beats the Unix toolkit in many contexts. Field renames in particular are worth doing as a pre-pipe or post-pipe using sed.