Skip to content

DSL overview

DSL stands for domain-specific language: it's language particular to Miller which you can use to write expressions to specify customer transformations to your data. (See Miller programming language for an introduction.)

Verbs compared to DSL

While put and filter are verbs, they're different from the rest in that they let you use the DSL -- so we often contrast DSL (things you can do in the put and filter verbs), and verbs (things you can do using the other verbs besides put and filter.)

Here's comparison of verbs and put/filter DSL expressions:

Example:

mlr stats1 -a sum -f x -g a data/small
a=pan,x_sum=0.346791
a=eks,x_sum=1.140078
a=wye,x_sum=0.777891
  • Verbs are coded in Go
  • They run a bit faster
  • They take fewer keystrokes
  • There is less to learn
  • Their customization is limited to each verb's options

Example:

mlr  put -q '@x_sum[$a] += $x; end{emit @x_sum, "a"}' data/small
a=pan,x_sum=0.346791
a=eks,x_sum=1.140078
a=wye,x_sum=0.777891
  • You get to write your own DSL expressions
  • They run a bit slower
  • They take more keystrokes
  • There is more to learn
  • They are highly customizable

Please see Verbs Reference for information on verbs other than put and filter.

Implicit loop over records for main statements

The most important point about the Miller DSL is that it is designed for streaming operation over records.

DSL statements include:

  • func and subr for user-defined functions and subroutines, which we'll look at later in the separate page about them;
  • begin and end blocks, for statements you want to run before the first record, or after the last one;
  • everything else, which collectively are called main statements.

The feature of streaming operation over records is implemented by the main statements getting invoked once per record. You don't explicitly loop over records, as you would in some dataframes contexts; rather, Miller loops over records for you, and it lets you specify what to do on each record: you write the body of the loop, not the loop itself.

(But you can, if you like, use those per-record statements to grow a list of records, then loop over them all in an end block. This is described in the page on operating on all records.)

To see this in action, let's take a look at the data/short.csv file:

cat data/short.csv
word,value
apple,37
ball,28
cat,54

There are three records in this file, with word=apple, word=ball, and word=cat, respectively. Let's print something in a begin statement, add a field in a main statement, and print something else in an end statement:

mlr --csv --from data/short.csv put '
  begin {
    print "begin";
  }
  $nr = NR;
  end {
    print "end";
  }
'
begin
word,value,nr
apple,37,1
ball,28,2
cat,54,3
end

The print statements for begin and end went out before the first record was seen and after the last was seen; the field-creation statement $nr = NR was invoked three times, once for each record. We didn't explicitly loop over records, since Miller was already looping over records, and invoked our main statement on each loop iteration.

For almost all simple uses of the Miller programming language, this implicit looping over records is probably all you will need. (For more involved cases you can see the pages on operating on all records, out-of-stream variables, and two-pass algorithms.)

Essential use: record-selection and record-updating

The essential usages of mlr filter and mlr put are for record-selection and record-updating expressions, respectively. For example, given the following input data:

cat data/small
a=pan,b=pan,i=1,x=0.346791,y=0.726802
a=eks,b=pan,i=2,x=0.758679,y=0.522151
a=wye,b=wye,i=3,x=0.204603,y=0.338318
a=eks,b=wye,i=4,x=0.381399,y=0.134188
a=wye,b=pan,i=5,x=0.573288,y=0.863624

you might retain only the records whose a field has value eks:

mlr filter '$a == "eks"' data/small
a=eks,b=pan,i=2,x=0.758679,y=0.522151
a=eks,b=wye,i=4,x=0.381399,y=0.134188

or you might add a new field which is a function of existing fields:

mlr put '$ab = $a . "_" . $b ' data/small
a=pan,b=pan,i=1,x=0.346791,y=0.726802,ab=pan_pan
a=eks,b=pan,i=2,x=0.758679,y=0.522151,ab=eks_pan
a=wye,b=wye,i=3,x=0.204603,y=0.338318,ab=wye_wye
a=eks,b=wye,i=4,x=0.381399,y=0.134188,ab=eks_wye
a=wye,b=pan,i=5,x=0.573288,y=0.863624,ab=wye_pan

Differences between put and filter

The two verbs mlr filter and mlr put are essentially the same. The only differences are:

  • Expressions sent to mlr filter should contain a boolean expression, which is the filtering criterion. (If not, all records pass through.)

  • mlr filter expressions may not reference the filter keyword within them.

Location of boolean expression for filter

You can define and invoke functions and subroutines to help produce the bare-boolean statement, and record fields may be assigned in the statements before or after the bare-boolean statement. For example:

mlr --c2p --from example.csv filter '
  # Bare-boolean filter expression: only records matching this pass through:
  $quantity >= 70;
  # For records that do pass through, set these:
  if ($rate > 8) {
    $description = "high rate";
  } else {
    $description = "low rate";
  }
'
color  shape    flag  k  index quantity rate   description
red    square   true  2  15    79.2778  0.0130 low rate
red    square   false 4  48    77.5542  7.4670 low rate
purple triangle false 5  51    81.2290  8.5910 high rate
red    square   false 6  64    77.1991  9.5310 high rate
purple triangle false 7  65    80.1405  5.8240 low rate
purple square   false 10 91    72.3735  8.2430 high rate
mlr --c2p --from example.csv filter '
  # Bare-boolean filter expression: only records matching this pass through:
  $shape =~ "^(...)(...)$";
  # For records that do pass through, capture the first "(...)" into $left and
  # the second "(...)" into $right
  $left = "\1";
  $right = "\2";
'
color  shape  flag  k  index quantity rate   left right
red    square true  2  15    79.2778  0.0130 squ  are
red    circle true  3  16    13.8103  2.9010 cir  cle
red    square false 4  48    77.5542  7.4670 squ  are
red    square false 6  64    77.1991  9.5310 squ  are
yellow circle true  8  73    63.9785  4.2370 cir  cle
yellow circle true  9  87    63.5058  8.3350 cir  cle
purple square false 10 91    72.3735  8.2430 squ  are

There are more details and more choices, of course, as detailed in the following sections.