DSL overview¶
DSL stands for domain-specific language: it's language particular to Miller which you can use to write expressions to specify customer transformations to your data. (See Miller programming language for an introduction.)
Verbs compared to DSL¶
While put
and filter
are verbs, they're different
from the rest in that they let you use the DSL -- so we often contrast DSL
(things you can do in the put
and filter
verbs), and verbs (things you
can do using the other verbs besides put
and filter
.)
Here's comparison of verbs and put
/filter
DSL expressions:
Example:
mlr stats1 -a sum -f x -g a data/small
a=pan,x_sum=0.346791 a=eks,x_sum=1.140078 a=wye,x_sum=0.777891
- Verbs are coded in Go
- They run a bit faster
- They take fewer keystrokes
- There is less to learn
- Their customization is limited to each verb's options
Example:
mlr put -q '@x_sum[$a] += $x; end{emit @x_sum, "a"}' data/small
a=pan,x_sum=0.346791 a=eks,x_sum=1.140078 a=wye,x_sum=0.777891
- You get to write your own DSL expressions
- They run a bit slower
- They take more keystrokes
- There is more to learn
- They are highly customizable
Please see Verbs Reference for information on verbs other than put
and filter
.
Implicit loop over records for main statements¶
The most important point about the Miller DSL is that it is designed for streaming operation over records.
DSL statements include:
func
andsubr
for user-defined functions and subroutines, which we'll look at later in the separate page about them;begin
andend
blocks, for statements you want to run before the first record, or after the last one;- everything else, which collectively are called main statements.
The feature of streaming operation over records is implemented by the main statements getting invoked once per record. You don't explicitly loop over records, as you would in some dataframes contexts; rather, Miller loops over records for you, and it lets you specify what to do on each record: you write the body of the loop, not the loop itself.
(But you can, if you like, use those per-record statements to grow a list of
records, then loop over them all in an end
block. This is described in the
page on operating on all records.)
To see this in action, let's take a look at the data/short.csv file:
cat data/short.csv
word,value apple,37 ball,28 cat,54
There are three records in this file, with word=apple
, word=ball
, and
word=cat
, respectively. Let's print something in a begin
statement, add a
field in a main statement, and print something else in an end
statement:
mlr --csv --from data/short.csv put ' begin { print "begin"; } $nr = NR; end { print "end"; } '
begin word,value,nr apple,37,1 ball,28,2 cat,54,3 end
The print
statements for begin
and end
went out before the first record
was seen and after the last was seen; the field-creation statement $nr = NR
was invoked three times, once for each record. We didn't explicitly loop over
records, since Miller was already looping over records, and invoked our main
statement on each loop iteration.
For almost all simple uses of the Miller programming language, this implicit looping over records is probably all you will need. (For more involved cases you can see the pages on operating on all records, out-of-stream variables, and two-pass algorithms.)
Essential use: record-selection and record-updating¶
The essential usages of mlr filter
and mlr put
are for record-selection and
record-updating expressions, respectively. For example, given the following
input data:
cat data/small
a=pan,b=pan,i=1,x=0.346791,y=0.726802 a=eks,b=pan,i=2,x=0.758679,y=0.522151 a=wye,b=wye,i=3,x=0.204603,y=0.338318 a=eks,b=wye,i=4,x=0.381399,y=0.134188 a=wye,b=pan,i=5,x=0.573288,y=0.863624
you might retain only the records whose a
field has value eks
:
mlr filter '$a == "eks"' data/small
a=eks,b=pan,i=2,x=0.758679,y=0.522151 a=eks,b=wye,i=4,x=0.381399,y=0.134188
or you might add a new field which is a function of existing fields:
mlr put '$ab = $a . "_" . $b ' data/small
a=pan,b=pan,i=1,x=0.346791,y=0.726802,ab=pan_pan a=eks,b=pan,i=2,x=0.758679,y=0.522151,ab=eks_pan a=wye,b=wye,i=3,x=0.204603,y=0.338318,ab=wye_wye a=eks,b=wye,i=4,x=0.381399,y=0.134188,ab=eks_wye a=wye,b=pan,i=5,x=0.573288,y=0.863624,ab=wye_pan
Differences between put and filter¶
The two verbs mlr filter
and mlr put
are essentially the same. The only differences are:
-
Expressions sent to
mlr filter
should contain a boolean expression, which is the filtering criterion. (If not, all records pass through.) -
mlr filter
expressions may not reference thefilter
keyword within them.
Location of boolean expression for filter¶
You can define and invoke functions and subroutines to help produce the bare-boolean statement, and record fields may be assigned in the statements before or after the bare-boolean statement. For example:
mlr --c2p --from example.csv filter ' # Bare-boolean filter expression: only records matching this pass through: $quantity >= 70; # For records that do pass through, set these: if ($rate > 8) { $description = "high rate"; } else { $description = "low rate"; } '
color shape flag k index quantity rate description red square true 2 15 79.2778 0.0130 low rate red square false 4 48 77.5542 7.4670 low rate purple triangle false 5 51 81.2290 8.5910 high rate red square false 6 64 77.1991 9.5310 high rate purple triangle false 7 65 80.1405 5.8240 low rate purple square false 10 91 72.3735 8.2430 high rate
mlr --c2p --from example.csv filter ' # Bare-boolean filter expression: only records matching this pass through: $shape =~ "^(...)(...)$"; # For records that do pass through, capture the first "(...)" into $left and # the second "(...)" into $right $left = "\1"; $right = "\2"; '
color shape flag k index quantity rate left right red square true 2 15 79.2778 0.0130 squ are red circle true 3 16 13.8103 2.9010 cir cle red square false 4 48 77.5542 7.4670 squ are red square false 6 64 77.1991 9.5310 squ are yellow circle true 8 73 63.9785 4.2370 cir cle yellow circle true 9 87 63.5058 8.3350 cir cle purple square false 10 91 72.3735 8.2430 squ are
There are more details and more choices, of course, as detailed in the following sections.