Quick links: Flags Verbs Functions Glossary Release docs

Operating on all records

As we saw in the DSL-overview page, the Miller programming language has an implicit loop over records for main statements.

Miller's feature of streaming operation over records is implemented by the main statements (everything outside begin/end/func/subr) getting invoked once per record. You don't explicitly loop over records, as you would in some dataframes contexts; rather, Miller loops over records for you, and it lets you specify what to do on each record: you write the body of the loop.

That's fine for most simple use-cases, but sometimes you do want to loop over all records. Here we describe a few options.

Sums/counters

The first option is to leverage the fact that main DSL statements are already invoked in a loop over records, and use out-of-stream variables to retain sums, counters, etc.

For example, let's look at our short data file data/short.csv:

cat data/short.csv

word,value
apple,37
ball,28
cat,54

We can track count and sum using out-of-stream variables -- the ones that start with the @ sigil -- then emit them as a new record after all the input is read.

mlr --icsv --ojson --from data/short.csv put '
  begin {
    @count = 0;
    @sum = 0;
  }
  @count += 1;
  @sum += $value;
  end {
    emit (@count, @sum);
  }
'

[
{
  "word": "apple",
  "value": 37
},
{
  "word": "ball",
  "value": 28
},
{
  "word": "cat",
  "value": 54
},
{
  "count": 3,
  "sum": 119
}
]

And if all we want is the final output and not the input data, we can use put -q to not pass through the input records:

mlr --icsv --ojson --from data/short.csv put -q '
  begin {
    @count = 0;
    @sum = 0;
  }
  @count += 1;
  @sum += $value;
  end {
    emit (@count, @sum);
  }
'

[
{
  "count": 3,
  "sum": 119
}
]

As discussed a bit more on the page on streaming processing and memory usage, this doesn't keep all records in memory, only the count and sum variables. You can use this on very large files without running out of memory.

Retaining records in a map

The second option is to retain entire records in a map, then loop over them in an end block.

Let's use the same short data file data/short.csv:

cat data/short.csv

word,value
apple,37
ball,28
cat,54

mlr --icsv --ojson --from data/short.csv put -q '
  # map
  begin {
    @records = {};
  }
  @records[NR] = $*;
  end {
    count = length(@records);
    sum = 0;
    for (i = 1; i <= NR; i += 1) {
      sum += @records[i]["value"];
    }
    dump @records; # show the map
    emit (count, sum);
  }
'

{
  "1": {
    "word": "apple",
    "value": 37
  },
  "2": {
    "word": "ball",
    "value": 28
  },
  "3": {
    "word": "cat",
    "value": 54
  }
}
[
{
  "count": 3,
  "sum": 119
}
]

The downside to this, of course, is that this retains all records (plus data-structure overhead) in memory, so you're limited to processing files that fit in your computer's memory. The upside, though, is that you can do random access over the records using things like

    output = 0;
    for (i = 1; i <= NR; i += 1) {
      for (j = 1; j <= NR; j += 1) {
        for (k = 1; k <= NR; k += 1) {
          output += call_some_function_of(@records[i], @records[j], @record[k])
        }
      }
    }
    # do something with the output

Retaining records in an array

The third option is to retain records in an array, then loop over them in an end block.

mlr --icsv --ojson --from data/short.csv put -q '
  # array
  begin {
    @records = [];
  }
  @records[NR] = $*;
  end {
    count = length(@records);
    sum = 0;
    for (i = 1; i <= NR; i += 1) {
      sum += @records[i]["value"];
    }
    dump @records; # show the array
    emit (count, sum);
  }
'

[
  {
    "word": "apple",
    "value": 37
  },
  {
    "word": "ball",
    "value": 28
  },
  {
    "word": "cat",
    "value": 54
  }
]
[
{
  "count": 3,
  "sum": 119
}
]

Just as with the retain-as-map approach, the downside is the overhead of retaining all records in memory, and the upside is that you get random access over records.

Using maps vs using arrays

Retaining records as a map or as an array is a matter of taste. Some things to note:

If we initialize @records = {} in the begin block (or, if we don't initialize it at all and just start writing to it in the main statements) then @records is a map . If we initialize @records=[] then it's an array.

Arrays are, of course, contiguously indexed. (And, in Miller, their indices start with 1, not 0 as discussed in the Arrays page.) This means that if you are only retaining a subset of records then your array will have null-gaps in it:

mlr --icsv --ojson --from data/short.csv put -q '
  begin {
    @records = [];
  }
  if (NR != 2) {
    @records[NR] = $*
  }
  end {
    dump @records;
  }
'

[
  {
    "word": "apple",
    "value": 37
  },
  null,
  {
    "word": "cat",
    "value": 54
  }
]
[
]

You can index @records by @count rather than NR to get a contiguous array:

mlr --icsv --ojson --from data/short.csv put -q '
  begin {
    @records = [];
    @count = 0;
  }
  # main statement
  if (NR != 2) {
    @count += 1;
    @records[@count] = $*;
  }
  end {
    dump @records;
    count = length(@records);
    sum = 0;
    for (record in @records) {
      sum += record["value"];
    }
    emit (count, sum);
  }
'

[
  {
    "word": "apple",
    "value": 37
  },
  {
    "word": "cat",
    "value": 54
  }
]
[
{
  "count": 2,
  "sum": 91
}
]

If you use a map to retain records, then this is a non-issue: maps can retain whatever values you like:

mlr --icsv --ojson --from data/short.csv put -q '
  begin {
    @records = {};
  }
  # main statement
  if (NR != 2) {
    @records[NR] = $*;
  }
  end {
    dump @records;
    count = length(@records);
    sum = 0;
    for (key in @records) {
      sum += @records[key]["value"];
    }
    emit (count, sum);
  }
'

{
  "1": {
    "word": "apple",
    "value": 37
  },
  "3": {
    "word": "cat",
    "value": 54
  }
}
[
{
  "count": 2,
  "sum": 91
}
]

Do note that Miller maps preserve insertion order, so at the end you're guaranteed to loop over records in the same order you read them. Also note that when you index a Miller map with an integer key, this works, but the key is stringified.

Retaining partial records in map or array

If all you need is one or a few attributes out of a record, you don't need to retain full records. You can retain a map, or array, of just the fields you're interested in:

mlr --icsv --ojson --from data/short.csv put -q '
  begin {
    @values = {};
  }
  # main statement
  if (NR != 2) {
    @values[NR] = $value;
  }
  end {
    dump @values;
    count = length(@values);
    sum = 0;
    for (key in @values) {
      sum += @values[key];
    }
    emit (count, sum);
  }
'

{
  "1": 37,
  "3": 54
}
[
{
  "count": 2,
  "sum": 91
}
]

Sorting

Please see the sorting page.

For more information

Please see the page on two-pass algorithms; see also the page on higher-order functions.