Overview: • About Miller • Miller in 10 minutes • File formats • Miller features in the context of the Unix toolkit • Record-heterogeneity • Internationalization Using Miller: • FAQ • Cookbook part 1 • Cookbook part 2 • Cookbook part 3 • Data-diving examples • Manpage • Reference • Reference: Verbs • Reference: DSL • Documents by release • Installation, portability, dependencies, and testing Background: • Why? • Why C? • Why call it Miller? • How original is Miller? • Performance Repository: • Things to do • Contact information • GitHub repo |
Data
Test data were of the form
Data files of one million lines (totalling about 50MB for CSV and 60MB for DKVP) were used. In experiments not shown here, I also varied the file sizes; the size-dependent results were the expected, completely unsurprising linearities and so I produced no file-size-dependent plots for your viewing pleasure. Comparands
The cat, cut, awk, sed, sort tools
were compared to mlr on an 8-core Darwin laptop; RAM capacity was
nowhere near challenged . The catc program is a simple line-oriented
line-printer (source
here) which is intermediate between Miller (which is record-aware as well
as line-aware) and cat (which is only byte-aware).
Raw results
Note that for CSV data, the command is mlr --csvlite ... rather than mlr ....
Mac Mac Comparand DKVP CSV seconds seconds 0.016 0.013 cat 0.189 0.189 catc 3.657 4.388 awk -F, '{print}' 2.027 1.795 mlr cat 2.292 1.940 cut -d , -f 1,4 3.540 4.516 awk -F, '{print $1,$4}' 1.600 1.390 mlr cut -f a,x 1.694 1.648 mlr cut -x -f a,x 0.845 0.643 sed -e 's/x=/EKS=/' -e 's/b=/BEE=/' 2.076 1.842 mlr rename x,EKS,b,BEE 5.643 5.031 awk -F, '{gsub("x=","",$4);gsub("y=","",$5);print $4+$5}' 4.019 3.679 mlr put '$z=$x+$y' 2.481 2.628 mlr stats1 -a mean -f x,y -g a,b 2.587 2.389 mlr stats2 -a corr -f x,y -g a,b 23.247 14.466 sort -t, -k 1,2 3.023 5.658 mlr sort -f a,b 17.224 15.523 sort -t, -k 4,5 5.807 5.194 mlr sort -n x,y Analysis
Conclusion
For record-oriented data transformations, Miller meets or beats the Unix
toolkit in many contexts. Field renames in particular are worth doing as a
pre-pipe or post-pipe using sed.
|