How original is Miller?
It isn’t. Miller is one of many, many participants in the
online-analytical-processing culture. Other key participants include
awk
, SQL, spreadsheets, etc. etc. etc. Far from being an original
concept, Miller explicitly strives to imitate several existing tools:
Unix toolkit: Intentional similarities as described in
Unix-toolkit context.
Recipes abound for command-line data analysis using the Unix toolkit. Here are just a couple of my favorites:
RecordStream: Miller owes particular inspiration
to RecordStream. The
key difference is that RecordStream is a Perl-based tool for manipulating JSON
(including requiring it to separately manipulate other formats such as CSV into
and out of JSON), while Miller is fast C which handles its formats natively.
The similarities include the sort
, stats1
(analog of
RecordStream’s collate
), and delta
operations, as well
as filter
and put
, and pretty-print formatting.
stats_m: A third source of lineage is my Python
stats_m
module. This includes simple single-pass algorithms which form Miller’s
stats1
and stats2
subcommands.
SQL: Fourthly, Miller’s group-by
command
name is from SQL, as is the term aggregate
.
Added value:
Miller’s added values include:
- Name-indexing, compared to the Unix toolkit’s positional indexing.
- Raw speed, compared to
awk
, RecordStream, stats_m
, or various other kinds of Python/Ruby/etc. scripts one can easily create.
- Compact keystroking for many common tasks, with a decent amount of flexibility.
- Ability to handle text files on the Unix pipe, without need for creating database tables, compared to SQL databases.
- Various file formats, and on-the-fly format conversion.
jq: Miller does for name-indexed text what
jq does for JSON. If you’re
not already familiar with jq
, please check it out!.
What about similar tools?
Here’s a comprehensive list:
https://github.com/dbohdan/structured-text-tools.
It doesn’t mention rows so here’s a plug for that as well.
As it turns out, I learned about most of these after writing Miller.
What about DOTADIW? One of the key points of the
Unix philosophy is
that a tool should do one thing and do it well. Hence sort
and
cut
do just one thing. Why does Miller put awk
-like
processing, a few SQL-like operations, and statistical reduction all into one
tool (see also Main reference)? This is a fair
question. First note that many standard tools, such as awk
and
perl
, do quite a few things — as does jq
. But I could
have pushed for putting format awareness and name-indexing options into
cut
, awk
, and so on (so you could do cut -f
hostname,uptime
or awk '{sum += $x*$y}END{print sum}'
). Patching
cut
, sort
, etc. on multiple operating systems is a
non-starter in terms of uptake. Moreover, it makes sense for me to have Miller
be a tool which collects together format-aware record-stream processing into
one place, with good reuse of Miller-internal library code for its various
features.
Why not use Perl/Python/Ruby etc.? Maybe you
should. With those tools you’ll get far more expressive power, and
sufficiently quick turnaround time for small-to-medium-sized data. Using
Miller you’ll get something less than a complete programming language,
but which is fast, with moderate amounts of flexibility and much less
keystroking.
When I was first developing Miller I made a survey of several languages.
Using low-level implementation languages like C, Go, Rust, and Nim, I’d
need to create my own domain-specific language (DSL) which would always be less
featured than a full programming language, but I’d get better
performance. Using high-level interpreted languages such as Perl/Python/Ruby
I’d get the language’s eval
for free and I wouldn’t
need a DSL; Miller would have mainly been a set of format-specific I/O hooks.
If I’d gotten good enough performance from the latter I’d have done
it without question and Miller would be far more flexible. But C won the
performance criteria by a landslide so we have Miller in C with a custom DSL.
No, really, why one more command-line data-manipulation
tool? I wrote Miller because I was frustrated with tools like
grep
, sed
, and so on being line-aware without being
format-aware. The single most poignant example I can think of is seeing
people grep data lines out of their CSV files and sadly losing their header
lines. While some lighter-than-SQL processing is very nice to have, at core I
wanted the format-awareness of RecordStream combined
with the raw speed of the Unix toolkit. Miller does precisely that.