Operating on all records
As we saw in the DSL-overview page, the Miller programming language has an implicit loop over records for main statements.
Miller's feature of streaming operation over
records is implemented by the main statements
(everything outside begin
/end
/func
/subr
) getting invoked once per
record. You don't explicitly loop over records, as you would in some dataframes
contexts; rather, Miller loops over records for you, and it lets you specify
what to do on each record: you write the body of the loop.
That's fine for most simple use-cases, but sometimes you do want to loop over all records. Here we describe a few options.
Sums/counters
The first option is to leverage the fact that main DSL statements are already invoked in a loop over records, and use out-of-stream variables to retain sums, counters, etc.
For example, let's look at our short data file data/short.csv:
cat data/short.csv
word,value apple,37 ball,28 cat,54
We can track count and sum using
out-of-stream variables -- the ones that
start with the @
sigil -- then
emit them as a new record
after all the input is read.
mlr --icsv --ojson --from data/short.csv put ' begin { @count = 0; @sum = 0; } @count += 1; @sum += $value; end { emit (@count, @sum); } '
[ { "word": "apple", "value": 37 }, { "word": "ball", "value": 28 }, { "word": "cat", "value": 54 }, { "count": 3, "sum": 119 } ]
And if all we want is the final output and not the input data, we can use put
-q
to not pass through the input records:
mlr --icsv --ojson --from data/short.csv put -q ' begin { @count = 0; @sum = 0; } @count += 1; @sum += $value; end { emit (@count, @sum); } '
[ { "count": 3, "sum": 119 } ]
As discussed a bit more on the page on streaming processing and memory usage, this doesn't keep all records in memory, only the count and sum variables. You can use this on very large files without running out of memory.
Retaining records in a map
The second option is to retain entire records in a map, then loop over them in an end
block.
Let's use the same short data file data/short.csv:
cat data/short.csv
word,value apple,37 ball,28 cat,54
mlr --icsv --ojson --from data/short.csv put -q ' # map begin { @records = {}; } @records[NR] = $*; end { count = length(@records); sum = 0; for (i = 1; i <= NR; i += 1) { sum += @records[i]["value"]; } dump @records; # show the map emit (count, sum); } '
{ "1": { "word": "apple", "value": 37 }, "2": { "word": "ball", "value": 28 }, "3": { "word": "cat", "value": 54 } } [ { "count": 3, "sum": 119 } ]
The downside to this, of course, is that this retains all records (plus data-structure overhead) in memory, so you're limited to processing files that fit in your computer's memory. The upside, though, is that you can do random access over the records using things like
output = 0; for (i = 1; i <= NR; i += 1) { for (j = 1; j <= NR; j += 1) { for (k = 1; k <= NR; k += 1) { output += call_some_function_of(@records[i], @records[j], @record[k]) } } } # do something with the output
Retaining records in an array
The third option is to retain records in an array, then loop over them in an end
block.
mlr --icsv --ojson --from data/short.csv put -q ' # array begin { @records = []; } @records[NR] = $*; end { count = length(@records); sum = 0; for (i = 1; i <= NR; i += 1) { sum += @records[i]["value"]; } dump @records; # show the array emit (count, sum); } '
[ { "word": "apple", "value": 37 }, { "word": "ball", "value": 28 }, { "word": "cat", "value": 54 } ] [ { "count": 3, "sum": 119 } ]
Just as with the retain-as-map approach, the downside is the overhead of retaining all records in memory, and the upside is that you get random access over records.
Using maps vs using arrays
Retaining records as a map or as an array is a matter of taste. Some things to note:
If we initialize @records = {}
in the begin
block (or, if we don't initialize it at all and just start writing to it in the main statements) then @records
is a map . If we initialize @records=[]
then it's an array.
Arrays are, of course, contiguously indexed. (And, in Miller, their indices start with 1, not 0 as discussed in the Arrays page.) This means that if you are only retaining a subset of records then your array will have null-gaps in it:
mlr --icsv --ojson --from data/short.csv put -q ' begin { @records = []; } if (NR != 2) { @records[NR] = $* } end { dump @records; } '
[ { "word": "apple", "value": 37 }, null, { "word": "cat", "value": 54 } ] [ ]
You can index @records
by @count
rather than NR
to get a contiguous array:
mlr --icsv --ojson --from data/short.csv put -q ' begin { @records = []; @count = 0; } # main statement if (NR != 2) { @count += 1; @records[@count] = $*; } end { dump @records; count = length(@records); sum = 0; for (record in @records) { sum += record["value"]; } emit (count, sum); } '
[ { "word": "apple", "value": 37 }, { "word": "cat", "value": 54 } ] [ { "count": 2, "sum": 91 } ]
If you use a map to retain records, then this is a non-issue: maps can retain whatever values you like:
mlr --icsv --ojson --from data/short.csv put -q ' begin { @records = {}; } # main statement if (NR != 2) { @records[NR] = $*; } end { dump @records; count = length(@records); sum = 0; for (key in @records) { sum += @records[key]["value"]; } emit (count, sum); } '
{ "1": { "word": "apple", "value": 37 }, "3": { "word": "cat", "value": 54 } } [ { "count": 2, "sum": 91 } ]
Do note that Miller maps preserve insertion order, so at the end you're guaranteed to loop over records in the same order you read them. Also note that when you index a Miller map with an integer key, this works, but the key is stringified.
Retaining partial records in map or array
If all you need is one or a few attributes out of a record, you don't need to retain full records. You can retain a map, or array, of just the fields you're interested in:
mlr --icsv --ojson --from data/short.csv put -q ' begin { @values = {}; } # main statement if (NR != 2) { @values[NR] = $value; } end { dump @values; count = length(@values); sum = 0; for (key in @values) { sum += @values[key]; } emit (count, sum); } '
{ "1": 37, "3": 54 } [ { "count": 2, "sum": 91 } ]
Sorting
Please see the sorting page.
For more information
Please see the page on two-pass algorithms; see also the page on higher-order functions.