How original is Miller?

It isn’t. Miller is one of many, many participants in the online-analytical-processing culture. Other key participants include awk, SQL, spreadsheets, etc. etc. etc. Far from being an original concept, Miller explicitly strives to imitate several existing tools:

Unix toolkit: Intentional similarities as described in Miller features in the context of the Unix toolkit.

Recipes abound for command-line data analysis using the Unix toolkit. Here are just a couple of my favorites:

RecordStream: Miller owes particular inspiration to RecordStream. The key difference is that RecordStream is a Perl-based tool for manipulating JSON (including requiring it to separately manipulate other formats such as CSV into and out of JSON), while Miller is fast C which handles its formats natively. The similarities include the sort, stats1 (analog of RecordStream’s collate), and delta operations, as well as filter and put, and pretty-print formatting.

stats_m: A third source of lineage is my Python stats_m module. This includes simple single-pass algorithms which form Miller’s stats1 and stats2 subcommands.

SQL: Fourthly, Miller’s group-by command name is from SQL, as is the term aggregate.

Added value: Miller’s added values include:

  • Name-indexing, compared to the Unix toolkit’s positional indexing.
  • Raw speed, compared to awk, RecordStream, stats_m, or various other kinds of Python/Ruby/etc. scripts one can easily create.
  • Ability to handle text files on the Unix pipe, without need for creating database tables, compared to SQL databases.

jq: Miller does for name-indexed text what jq does for JSON. If you’re not already familiar with jq, please check it out!.

What about DOTADIW? One of the key points of the Unix philosophy is that a tool should do one thing and do it well. Hence sort and cut do just one thing. Why does Miller put awk-like processing, a few SQL-like operations, and statistical reduction all into one tool (see also Reference)? This is a fair question. First note that many standard tools, such as awk and perl, do quite a few things — as does jq. But I could have pushed for putting format awareness and name-indexing options into cut, awk, and so on (so you could do cut -f hostname,uptime or awk '{sum += $x*$y}END{print sum}'). Patching cut, sort, etc. on multiple operating systems is a non-starter in terms of uptake. Moreover, it makes sense for me to have Miller be a tool which collects together format-aware record-stream processing into one place, with good reuse of Miller-internal library code for its various features.

No, really, why one more command-line data-manipulation tool? I wrote Miller because I was frustrated with tools like grep, sed, and so on being line-aware without being format-aware. The single most poignant example I can think of is seeing people grep data lines out of their CSV files and sadly losing their header lines. While some lighter-than-SQL processing is very nice to have, at core I wanted the format-awareness of RecordStream combined with the raw speed of the Unix toolkit. Miller does precisely that.