Quick links: Flags Verbs Functions Glossary Release docs

Streaming processing, and memory usage

What does streaming mean?

When we say that Miller is streaming, we mean that most operations need only a single record in memory at a time, rather than ingesting all input before producing any output.

This is contrast to, say, the dataframes approach where you ingest all data, wait for end of file, then start manipulating the data.

Both approaches have their advantages: the dataframes approach requires that all data fit in system memory (which, as hardware gets larger over time, is less and less of a constraint); the streaming approach requires that you sometimes need to accumulate results on records (rows) as they arrive rather than looping through them explicitly.

Since Miller takes the streaming approach when possible (see below for exceptions), you can often operate on files which are larger than your system's memory . It also means you can do tail -f some-file | mlr --some-flags and Miller will operate on records as they arrive one at a time. You don't have to wait for and end-of-file marker (which never arrives with tail-f) to start seeing partial results. This also means if you pipe Miller's output to other streaming tools (like cat, grep, sed, and so on), they will also output partial results as data arrives.

The statements in the Miller programming language (outside of optional begin/end blocks which execute before and after all records have been read, respectively) are implicit callbacks which are executed once per record. For example, using mlr --csv put '$z = $x + $y' myfile.csv, the statement $z = $x + $y will be executed 10,000 times if you myfile.csv has 10,000 records.

If you do wish to accumulate all records into memory and loop over them explicitly, you can do so -- see the page on operating on all records.

Streaming and non-streaming verbs

Most verbs, including cat, cut, etc. operate on each record independently. They have no state to retain from one record to the next, and are streaming.

For those operations which require deeper retention, Miller retains only as much data as needed. For example, the sort and tac (stream-reverse, backward spelling of cat) must ingest and retain all records in memory before emitting any -- the last input record may well end up being the first one to be emitted.

stats1 Other verbs, such as tail and top, need to retain only a fixed number of records -- 10, perhaps, even if the input data has a million records.

Yet other verbs, such as stats1 and stats2, retain only summary arithmetic on the records they visit. These are memory-friendly: memory usage is bounded. However, they only produce output at the end of the record stream.

Fully streaming verbs

These don't retain any state from one record to the next. They are memory-friendly, and they don't wait for end of input to produce their output.

altkv
bar -- if not auto-mode
cat
check
clean-whitespace
cut
decimate
fill-down
fill-empty
flatten
format-values
gap
grep
having-fields
head
json-parse
json-stringify
label
merge-fields
nest -- if not implode-values-across-records
nothing
regularize
rename
reorder
repeat
reshape -- if not long-to-wide
sec2gmt
sec2gmtdate
seqgen
skip-trivial-records
sort-within-records
step
tee
template
unflatten
unsparsify if invoked with -f

Non-streaming, retaining all records

These retain all records from one record to the next. They are memory-unfriendly, and they wait for end of input to produce their output.

bar -- if auto-mode
bootstrap
count-similar
fraction
group-by
group-like
least-frequent
most-frequent
nest -- if implode-values-across-records
remove-empty-columns
reshape -- if long-to-wide
sample
shuffle
sort
tac
uniq -- if mlr uniq -a -c
unsparsify if invoked without -f

Non-streaming, retaining some records

These retain a bounded number of records from one record to the next. They are memory-friendly, but they wait for end of input to produce their output.

tail
top

Non-streaming, retaining some state

These retain an amount of state from one record to the next, but less than if they were to retain all records in memory. They are variably memory-friendly -- depending on how many distinct values for the group-by keys exist in the input data -- and they wait for end of input to produce their output.

count-distinct
count
histogram
stats1 -- except mlr stats1 -s for incremental stats before end of stream
stats2
uniq -- if not mlr uniq -a -c

Variable

Any end blocks you provide will not be executed until end of stream; otherwise these don't want for end of stream. Similarly, if you write logic to retain all records (see also the page on operating on all records) these will be memory-unfriendly; otherwise they are memory-friendly.

Most simple operations such as mlr put '$z = $x + $y' are fully streaming.

filter
put

Half-streaming

The main input files are streamed, but the join file (using -f) is loaded into memory at the start.