How original is Miller?
It isn’t. Miller is one of many, many participants in the
online-analytical-processing culture. Other key participants include
awk, SQL, spreadsheets, etc. etc. etc. Far from being an original
concept, Miller explicitly strives to imitate several existing tools:
Unix toolkit: Intentional similarities as described in
Miller features in the context of the Unix toolkit.
Recipes abound for command-line data analysis using the Unix toolkit. Here are just a couple of my favorites:
RecordStream: Miller owes particular inspiration
to
RecordStream. The
key difference is that RecordStream is a Perl-based tool for manipulating JSON
(including requiring it to separately manipulate other formats such as CSV into
and out of JSON), while Miller is fast C which handles its formats natively.
The similarities include the
sort,
stats1 (analog of
RecordStream’s
collate), and
delta operations, as well
as
filter and
put, and pretty-print formatting.
stats_m: A third source of lineage is my Python
stats_m
module. This includes simple single-pass algorithms which form Miller’s
stats1 and
stats2 subcommands.
SQL: Fourthly, Miller’s
group-by command
name is from SQL, as is the term
aggregate.
Added value:
Miller’s added values include:
- Name-indexing, compared to the Unix toolkit’s positional indexing.
- Raw speed, compared to awk, RecordStream, stats_m, or various other kinds of Python/Ruby/etc. scripts one can easily create.
- Compact keystroking for many common tasks, with a decent amount of flexibility.
- Ability to handle text files on the Unix pipe, without need for creating database tables, compared to SQL databases.
- Various file formats, and on-the-fly format conversion.
jq: Miller does for name-indexed text what
jq does for JSON. If you’re
not already familiar with
jq, please
check it out!.
What about similar tools?
Here’s a comprehensive list:
https://github.com/dbohdan/structured-text-tools.
It doesn’t mention
rows so here’s a plug for that as well.
As it turns out, I learned about most of these after writing Miller.
What about DOTADIW? One of the key points of the
Unix philosophy is
that a tool should do one thing and do it well. Hence
sort and
cut do just one thing. Why does Miller put
awk-like
processing, a few SQL-like operations, and statistical reduction all into one
tool (see also
Reference)? This is a fair
question. First note that many standard tools, such as
awk and
perl, do quite a few things — as does
jq. But I could
have pushed for putting format awareness and name-indexing options into
cut,
awk, and so on (so you could do
cut -f
hostname,uptime or
awk '{sum += $x*$y}END{print sum}'). Patching
cut,
sort, etc. on multiple operating systems is a
non-starter in terms of uptake. Moreover, it makes sense for me to have Miller
be a tool which collects together format-aware record-stream processing into
one place, with good reuse of Miller-internal library code for its various
features.
Why not use Perl/Python/Ruby etc.? Maybe you
should. With those tools you’ll get far more expressive power, and
sufficiently quick turnaround time for small-to-medium-sized data. Using
Miller you’ll get something less than a complete programming language,
but which is fast, with moderate amounts of flexibility and much less
keystroking.
When I was first developing Miller I made a survey of several languages.
Using low-level implementation languages like C, Go, Rust, and Nim, I’d
need to create my own domain-specific language (DSL) which would always be less
featured than a full programming language, but I’d get better
performance. Using high-level interpreted languages such as Perl/Python/Ruby
I’d get the language’s
eval for free and I wouldn’t
need a DSL; Miller would have mainly been a set of format-specific I/O hooks.
If I’d gotten good enough performance from the latter I’d have done
it without question and Miller would be far more flexible. But C won the
performance criteria by a landslide so we have Miller in C with a custom DSL.
No, really, why one more command-line data-manipulation
tool? I wrote Miller because I was frustrated with tools like
grep,
sed, and so on being
line-aware without being
format-aware. The single most poignant example I can think of is seeing
people grep data lines out of their CSV files and sadly losing their header
lines. While some lighter-than-SQL processing is very nice to have, at core I
wanted the format-awareness of
RecordStream combined
with the raw speed of the Unix toolkit. Miller does precisely that.