How original is Miller?
It isn’t. Miller is one of many, many participants in the
online-analytical-processing culture. Other key participants include
awk, SQL, spreadsheets, etc. etc. etc. Far from being an original
concept, Miller explicitly strives to imitate several existing tools:
Unix toolkit: Intentional similarities as described in
Miller features in the context of the Unix toolkit.
Recipes abound for command-line data analysis using the Unix toolkit. Here are just a couple of my favorites:
RecordStream: Miller owes particular inspiration
to
RecordStream. The
key difference is that RecordStream is a Perl-based tool for manipulating JSON
(including requiring it to separately manipulate other formats such as CSV into
and out of JSON), while Miller is fast C which handles its formats natively.
The similarities include the
sort,
stats1 (analog of
RecordStream’s
collate), and
delta operations, as well
as
filter and
put, and pretty-print formatting.
stats_m: A third source of lineage is my Python
stats_m
module. This includes simple single-pass algorithms which form Miller’s
stats1 and
stats2 subcommands.
SQL: Fourthly, Miller’s
group-by command
name is from SQL, as is the term
aggregate.
Added value:
Miller’s added values include:
- Name-indexing, compared to the Unix toolkit’s positional indexing.
- Raw speed, compared to awk, RecordStream, stats_m, or various other kinds of Python/Ruby/etc. scripts one can easily create.
- Ability to handle text files on the Unix pipe, without need for creating database tables, compared to SQL databases.
jq: Miller does for name-indexed text what
jq does for JSON. If you’re
not already familiar with
jq, please
check it out!.
What about DOTADIW? One of the key points of the
Unix philosophy is
that a tool should do one thing and do it well. Hence
sort and
cut do just one thing. Why does Miller put
awk-like
processing, a few SQL-like operations, and statistical reduction all into one
tool (see also
Reference)? This is a fair
question. First note that many standard tools, such as
awk and
perl, do quite a few things — as does
jq. But I could
have pushed for putting format awareness and name-indexing options into
cut,
awk, and so on (so you could do
cut -f
hostname,uptime or
awk '{sum += $x*$y}END{print sum}'). Patching
cut,
sort, etc. on multiple operating systems is a
non-starter in terms of uptake. Moreover, it makes sense for me to have Miller
be a tool which collects together format-aware record-stream processing into
one place, with good reuse of Miller-internal library code for its various
features.
No, really, why one more command-line data-manipulation
tool? I wrote Miller because I was frustrated with tools like
grep,
sed, and so on being
line-aware without being
format-aware. The single most poignant example I can think of is seeing
people grep data lines out of their CSV files and sadly losing their header
lines. While some lighter-than-SQL processing is very nice to have, at core I
wanted the format-awareness of
RecordStream combined
with the raw speed of the Unix toolkit. Miller does precisely that.