Reference

Contents:
• Command overview
• On-line help
• I/O options
    • Formats
    • Compression
    • Record/field/pair separators
    • Number formatting
• Data transformations
    • bar
    • bootstrap
    • cat
    • check
    • count-distinct
    • cut
    • decimate
    • filter
        • Features which filter shares with put
    • grep
    • group-by
    • group-like
    • having-fields
    • head
    • histogram
    • join
    • label
    • least-frequent
    • merge-fields
    • most-frequent
    • nest
    • nothing
    • put
        • Features which put shares with filter
    • regularize
    • rename
    • reorder
    • repeat
    • reshape
    • sample
    • sec2gmt
    • sec2gmtdate
    • seqgen
    • shuffle
    • sort
    • stats1
    • stats2
    • step
    • tac
    • tail
    • tee
    • top
    • uniq
• Expression language for filter and put
    • Syntax
        • Expression formatting
        • Expressions from files
        • Semicolons, newlines, and curly braces
    • Variables
        • Built-in variables
        • Field names
        • Local variables
        • Out-of-stream variables for put
        • Indexed out-of-stream variables for put
        • Aggregate variable assignments for put
        • Keywords for filter and put
    • Control structures
        • Pattern-action blocks
        • If-statements
        • While and do-while loops
        • For-loops
        • Begin/end blocks for put
    • Output statements for put
        • Emit statements for put
        • Multi-emit statements for put
        • Emit-all statements for put
        • Redirected-output statements for put
    • Unset statements for put
    • Filter statements for put
    • Built-in functions for filter and put
    • User-defined functions and subroutines
        • User-defined functions
        • User-defined subroutines
    • A note on the complexity of Miller’s expression language
• then-chaining
• Data types
• Null data: empty and absent
• String literals
• Regular expressions
    • Regex captures
• Operator precedence
• Operator and function semantics
• Arithmetic
    • Input scanning
    • Conversion by math routines
    • Conversion by arithmetic operators
    • Pythonic division

Command overview

Whereas the Unix toolkit is made of the separate executables cat, tail, cut, sort, etc., Miller has subcommands, invoked as follows:

mlr tac *.dat
mlr cut --complement -f os_version *.dat
mlr sort -f hostname,uptime *.dat

These fall into categories as follows:

Commands Description
cat, cut, head, sort, tac, tail, top, uniq Analogs of their Unix-toolkit namesakes, discussed below as well as in Miller features in the context of the Unix toolkit
filter, put, step awk-like functionality
histogram, stats1, stats2 Statistically oriented
group-by, group-like, having-fields Particularly oriented toward Record-heterogeneity, although all Miller commands can handle heterogeneous records
count-distinct, label, rename, rename, reorder These draw from other sources (see also How original is Miller?): count-distinct is SQL-ish, and rename can be done by sed (which does it faster: see Performance).

On-line help

Examples:

$ mlr --help
Usage: mlr [I/O options] {verb} [verb-dependent options ...] {zero or more file names}

Command-line-syntax examples:
  mlr --csv cut -f hostname,uptime mydata.csv
  mlr --tsv --rs lf filter '$status != "down" && $upsec >= 10000' *.tsv
  mlr --nidx put '$sum = $7 < 0.0 ? 3.5 : $7 + 2.1*$8' *.dat
  grep -v '^#' /etc/group | mlr --ifs : --nidx --opprint label group,pass,gid,member then sort -f group
  mlr join -j account_id -f accounts.dat then group-by account_name balances.dat
  mlr --json put '$attr = sub($attr, "([0-9]+)_([0-9]+)_.*", "\1:\2")' data/*.json
  mlr stats1 -a min,mean,max,p10,p50,p90 -f flag,u,v data/*
  mlr stats2 -a linreg-pca -f u,v -g shape data/*
  mlr put -q '@sum[$a][$b] += $x; end {emit @sum, "a", "b"}' data/*
  mlr --from estimates.tbl put '
  for (k,v in $*) {
    if (isnumeric(v) && k =~ "^[t-z].*$") {
      $sum += v; $count += 1
    }
  }
  $mean = $sum / $count # no assignment if count unset'
  mlr --from infile.dat put -f analyze.mlr
  mlr --from infile.dat put 'tee > "./taps/data-".$a."-".$b, $*'
  mlr --from infile.dat put 'tee | "gzip > ./taps/data-".$a."-".$b.".gz", $*'
  mlr --from infile.dat put -q '@v=$*; dump | "jq .[]"'
  mlr --from infile.dat put  '(NR % 1000 == 0) { print > stderr, "Checkpoint ".NR}'

Data-format examples:
  DKVP: delimited key-value pairs (Miller default format)
  +---------------------+
  | apple=1,bat=2,cog=3 | Record 1: "apple" => "1", "bat" => "2", "cog" => "3"
  | dish=7,egg=8,flint  | Record 2: "dish" => "7", "egg" => "8", "3" => "flint"
  +---------------------+

  NIDX: implicitly numerically indexed (Unix-toolkit style)
  +---------------------+
  | the quick brown     | Record 1: "1" => "the", "2" => "quick", "3" => "brown"
  | fox jumped          | Record 2: "1" => "fox", "2" => "jumped"
  +---------------------+

  CSV/CSV-lite: comma-separated values with separate header line
  +---------------------+
  | apple,bat,cog       |
  | 1,2,3               | Record 1: "apple => "1", "bat" => "2", "cog" => "3"
  | 4,5,6               | Record 2: "apple" => "4", "bat" => "5", "cog" => "6"
  +---------------------+

  Tabular JSON: nested objects are supported, although arrays within them are not:
  +---------------------+
  | {                   |
  |  "apple": 1,        | Record 1: "apple" => "1", "bat" => "2", "cog" => "3"
  |  "bat": 2,          |
  |  "cog": 3           |
  | }                   |
  | {                   |
  |   "dish": {         | Record 2: "dish:egg" => "7", "dish:flint" => "8", "garlic" => ""
  |     "egg": 7,       |
  |     "flint": 8      |
  |   },                |
  |   "garlic": ""      |
  | }                   |
  +---------------------+

  PPRINT: pretty-printed tabular
  +---------------------+
  | apple bat cog       |
  | 1     2   3         | Record 1: "apple => "1", "bat" => "2", "cog" => "3"
  | 4     5   6         | Record 2: "apple" => "4", "bat" => "5", "cog" => "6"
  +---------------------+

  XTAB: pretty-printed transposed tabular
  +---------------------+
  | apple 1             | Record 1: "apple" => "1", "bat" => "2", "cog" => "3"
  | bat   2             |
  | cog   3             |
  |                     |
  | dish 7              | Record 2: "dish" => "7", "egg" => "8"
  | egg  8              |
  +---------------------+

  Markdown tabular (supported for output only):
  +-----------------------+
  | | apple | bat | cog | |
  | | ---   | --- | --- | |
  | | 1     | 2   | 3   | | Record 1: "apple => "1", "bat" => "2", "cog" => "3"
  | | 4     | 5   | 6   | | Record 2: "apple" => "4", "bat" => "5", "cog" => "6"
  +-----------------------+

Help options:
  -h or --help Show this message.
  --version              Show the software version.
  {verb name} --help     Show verb-specific help.
  --list-all-verbs or -l List only verb names.
  --help-all-verbs       Show help on all verbs.

Verbs:
   bar bootstrap cat check count-distinct cut decimate filter grep group-by
   group-like having-fields head histogram join label least-frequent
   merge-fields most-frequent nest nothing put regularize rename reorder repeat
   reshape sample sec2gmt sec2gmtdate seqgen shuffle sort stats1 stats2 step
   tac tail tee top uniq

Functions for the filter and put verbs:
   + + - - * / // % ** | ^ & ~ << >> == != =~ !=~ > >= < <= && || ^^ ! ? :
   isnull isnotnull isabsent ispresent isempty isnotempty isnumeric isint
   isfloat isbool isstring boolean float fmtnum hexfmt int string typeof . gsub
   strlen sub tolower toupper abs acos acosh asin asinh atan atan2 atanh cbrt
   ceil cos cosh erf erfc exp expm1 floor invqnorm log log10 log1p logifit madd
   max mexp min mmul msub pow qnorm round roundm sgn sin sinh sqrt tan tanh
   urand urand32 urandint dhms2fsec dhms2sec fsec2dhms fsec2hms gmt2sec
   hms2fsec hms2sec sec2dhms sec2gmt sec2gmtdate sec2hms strftime strptime
   systime
Please use "mlr --help-function {function name}" for function-specific help.
Please use "mlr --help-all-functions" or "mlr -f" for help on all functions.
Please use "mlr --help-all-keywords" or "mlr -k" for help on all keywords.

Data-format options, for input, output, or both:
  --idkvp   --odkvp   --dkvp      Delimited key-value pairs, e.g "a=1,b=2"
                                  (this is Miller's default format).

  --inidx   --onidx   --nidx      Implicitly-integer-indexed fields
                                  (Unix-toolkit style).

  --icsv    --ocsv    --csv       Comma-separated value (or tab-separated
                                  with --fs tab, etc.)

  --itsv    --otsv    --tsv       Keystroke-savers for "--icsv --ifs tab",
                                  "--ocsv --ofs tab", "--csv --fs tab".

  --ipprint --opprint --pprint    Pretty-printed tabular (produces no
                                  output until all input is in).
                      --right     Right-justifies all fields for PPRINT output.
                      --barred    Prints a border around PPRINT output.

            --omd                 Markdown-tabular (only available for output).

  --ixtab   --oxtab   --xtab      Pretty-printed vertical-tabular.
                      --xvright   Right-justifies values for XTAB format.

  --ijson   --ojson   --json      JSON tabular: sequence or list of one-level
                                  maps: {...}{...} or [{...},{...}].
                      --jvstack   Put one key-value pair per line for JSON
                                  output.
                      --jlistwrap Wrap JSON output in outermost [ ].
                      --jquoteall Quote map keys in JSON output, even if they're
                                  numeric.
              --jflatsep {string} Separator for flattening multi-level JSON keys,
                                  e.g. '{"a":{"b":3}}' becomes a:b => 3 for
                                  non-JSON formats. Defaults to :.

  -p is a keystroke-saver for --nidx --fs space --repifs

  Examples: --csv for CSV-formatted input and output; --idkvp --opprint for
  DKVP-formatted input and pretty-printed output.

  PLEASE USE "mlr --csv --rs lf" FOR NATIVE UN*X (LINEFEED-TERMINATED) CSV FILES.
  You can also have MLR_CSV_DEFAULT_RS=lf in your shell environment, e.g.
  "export MLR_CSV_DEFAULT_RS=lf" or "setenv MLR_CSV_DEFAULT_RS lf" depending on
  which shell you use.

As keystroke-savers for format-conversion you may use the following:
  --c2t --c2d --c2n --c2j --c2x --c2p --c2m
  --t2c       --t2d --t2n --t2j --t2x --t2p --t2m
  --d2c --d2t       --d2n --d2j --d2x --d2p --d2m
  --n2c --n2t --n2d       --n2j --n2x --n2p --n2m
  --j2c --j2t --j2d --j2n       --j2x --j2p --j2m
  --x2c --x2t --x2d --x2n --x2j       --x2p --x2m
  --p2c --p2t --p2d --p2n --p2j --p2x       --p2m
The letters c t d n j x p m refer to formats CSV with LF, TSV with LF, DKVP,
NIDX, JSON, XTAB, PPRINT, and markdown, respectively. Note that markdown format
is available for output only.

Compressed-data options:
  --prepipe {command} This allows Miller to handle compressed inputs. You can do
  without this for single input files, e.g. "gunzip < myfile.csv.gz | mlr ...".
  However, when multiple input files are present, between-file separations are
  lost; also, the FILENAME variable doesn't iterate. Using --prepipe you can
  specify an action to be taken on each input file. This pre-pipe command must
  be able to read from standard input; it will be invoked with
    {command} < {filename}.
  Examples:
    mlr --prepipe 'gunzip'
    mlr --prepipe 'zcat -cf'
    mlr --prepipe 'xz -cd'
    mlr --prepipe cat
  Note that this feature is quite general and is not limited to decompression
  utilities. You can use it to apply per-file filters of your choice.
  For output compression (or other) utilities, simply pipe the output:
    mlr ... | {your compression command}

Separator options, for input, output, or both:
  --rs     --irs     --ors              Record separators, e.g. 'lf' or '\r\n'
  --fs     --ifs     --ofs  --repifs    Field separators, e.g. comma
  --ps     --ips     --ops              Pair separators, e.g. equals sign
  Notes:
  * IPS/OPS are only used for DKVP and XTAB formats, since only in these formats
    do key-value pairs appear juxtaposed.
  * IRS/ORS are ignored for XTAB format. Nominally IFS and OFS are newlines;
    XTAB records are separated by two or more consecutive IFS/OFS -- i.e.
    a blank line.
  * OFS must be single-character for PPRINT format. This is because it is used
    with repetition for alignment; multi-character separators would make
    alignment impossible.
  * OPS may be multi-character for XTAB format, in which case alignment is
    disabled.
  * DKVP, NIDX, CSVLITE, PPRINT, and XTAB formats are intended to handle
    platform-native text data. In particular, this means LF line-terminators
    by default on Linux/OSX. You can use "--dkvp --rs crlf" for
    CRLF-terminated DKVP files, and so on.
  * CSV is intended to handle RFC-4180-compliant data. In particular, this means
    it uses CRLF line-terminators by default. You can use "--csv --rs lf" for
    Linux-native CSV files.  You can also have "MLR_CSV_DEFAULT_RS=lf" in your
    shell environment, e.g.  "export MLR_CSV_DEFAULT_RS=lf" or "setenv
    MLR_CSV_DEFAULT_RS lf" depending on which shell you use.
  * TSV is simply CSV using tab as field separator ("--fs tab").
  * FS/PS are ignored for markdown format; RS is used.
  * All RS/FS/PS options are ignored for JSON format: JSON doesn't allow
    changing these.
  * You can specify separators in any of the following ways, shown by example:
    - Type them out, quoting as necessary for shell escapes, e.g.
      "--fs '|' --ips :"
    - C-style escape sequences, e.g. "--rs '\r\n' --fs '\t'".
    - To avoid backslashing, you can use any of the following names:
      cr crcr newline lf lflf crlf crlfcrlf tab space comma pipe slash colon semicolon equals
  * Default separators by format:
      File format  RS       FS       PS
      dkvp         \n       ,        =
      json         (N/A)    (N/A)    (N/A)
      nidx         \n       space    (N/A)
      csv          \r\n     ,        (N/A)
      csvlite      \n       ,        (N/A)
      markdown     \n       (N/A)    (N/A)
      pprint       \n       space    (N/A)
      xtab         (N/A)    \n       space

Relevant to CSV/CSV-lite input only:
  --implicit-csv-header Use 1,2,3,... as field labels, rather than from line 1
                     of input files. Tip: combine with "label" to recreate
                     missing headers.
  --headerless-csv-output   Print only CSV data lines.

Double-quoting for CSV output:
  --quote-all        Wrap all fields in double quotes
  --quote-none       Do not wrap any fields in double quotes, even if they have
                     OFS or ORS in them
  --quote-minimal    Wrap fields in double quotes only if they have OFS or ORS
                     in them (default)
  --quote-numeric    Wrap fields in double quotes only if they have numbers
                     in them
  --quote-original   Wrap fields in double quotes if and only if they were
                     quoted on input. This isn't sticky for computed fields:
                     e.g. if fields a and b were quoted on input and you do
                     "put '$c = $a . $b'" then field c won't inherit a or b's
                     was-quoted-on-input flag.

Numerical formatting:
  --ofmt {format}    E.g. %.18lf, %.0lf. Please use sprintf-style codes for
                     double-precision. Applies to verbs which compute new
                     values, e.g. put, stats1, stats2. See also the fmtnum
                     function within mlr put (mlr --help-all-functions).
                     Defaults to %lf.

Other options:
  --seed {n} with n of the form 12345678 or 0xcafefeed. For put/filter
                     urand()/urandint()/urand32().
  --nr-progress-mod {m}, with m a positive integer: print filename and record
                     count to stderr every m input records.
  --from {filename}  Use this to specify an input file before the verb(s),
                     rather than after. May be used more than once. Example:
                     "mlr --from a.dat --from b.dat cat" is the same as
                     "mlr cat a.dat b.dat".
  -n                 Process no input files, nor standard input either. Useful
                     for mlr put with begin/end statements only. (Same as --from
                     /dev/null.) Also useful in "mlr -n put -v '...'" for
                     analyzing abstract syntax trees (if that's your thing).

Then-chaining:
Output of one verb may be chained as input to another using "then", e.g.
  mlr stats1 -a min,mean,max -f flag,u,v -g color then sort -f color

For more information please see http://johnkerl.org/miller/doc and/or
http://github.com/johnkerl/miller. This is Miller version v4.5.0-dev.

$ mlr sort --help
Usage: mlr sort {flags}
Flags:
  -f  {comma-separated field names}  Lexical ascending
  -n  {comma-separated field names}  Numerical ascending; nulls sort last
  -nf {comma-separated field names}  Numerical ascending; nulls sort last
  -r  {comma-separated field names}  Lexical descending
  -nr {comma-separated field names}  Numerical descending; nulls sort first
Sorts records primarily by the first specified field, secondarily by the second
field, and so on.  Any records not having all specified sort keys will appear
at the end of the output, in the order they were encountered, regardless of the
specified sort order.
Example:
  mlr sort -f a,b -nr x,y,z
which is the same as:
  mlr sort -f a -f b -nr x -nr y -nr z

I/O options

Formats

Options:

  --dkvp    --idkvp    --odkvp
  --nidx    --inidx    --onidx
  --csv     --icsv     --ocsv
  --csvlite --icsvlite --ocsvlite
  --pprint  --ipprint  --ppprint  --right
  --xtab    --ixtab    --oxtab
  --json    --ijson    --ojson

These are as discussed in File formats, with the exception of --right which makes pretty-printed output right-aligned:

$ mlr --opprint cat data/small
a   b   i x                   y
pan pan 1 0.3467901443380824  0.7268028627434533
eks pan 2 0.7586799647899636  0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan 5 0.5732889198020006  0.8636244699032729

$ mlr --opprint --right cat data/small
  a   b i                   x                   y
pan pan 1  0.3467901443380824  0.7268028627434533
eks pan 2  0.7586799647899636  0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan 5  0.5732889198020006  0.8636244699032729

Additional notes:

  • Use --csv, --pprint, etc. when the input and output formats are the same.
  • Use --icsv --opprint, etc. when you want format conversion as part of what Miller does to your data.
  • DKVP (key-value-pair) format is the default for input and output. So, --oxtab is the same as --idkvp --oxtab.

Compression

Options:

  --prepipe {command}

The prepipe command is anything which reads from standard input and produces data acceptable to Miller. Nominally this allows you to use whichever decompression utilities you have installed on your system, on a per-file basis. If the command has flags, quote them: e.g. mlr --prepipe 'zcat -cf'. Examples:

# These two produce the same output:
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz
# With multiple input files you need --prepipe:
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz myfile2.csv.gz
$ mlr --prepipe gunzip --idkvp --oxtab cut -f hostname,uptime myfile1.dat.gz myfile2.dat.gz

# Similar to the above, but with compressed output as well as input:
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime | gzip > outfile.csv.gz
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz | gzip > outfile.csv.gz
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz myfile2.csv.gz | gzip > outfile.csv.gz

# Similar to the above, but with different compression tools for input and output:
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime | xz -z > outfile.csv.xz
$ xz -cd < myfile1.csv.xz | mlr cut -f hostname,uptime | gzip > outfile.csv.xz
$ mlr --prepipe 'xz -cd' cut -f hostname,uptime myfile1.csv.xz myfile2.csv.xz | xz -z > outfile.csv.xz

... etc.

Record/field/pair separators

Miller has record separators IRS and ORS, field separators IFS and OFS, and pair separators IPS and OPS. For example, in the DKVP line a=1,b=2,c=3, the record separator is newline, field separator is comma, and pair separator is the equals sign. These are the default values.

Options:

  --rs --irs --ors
  --fs --ifs --ofs --repifs
  --ps --ips --ops
  • You can change a separator from input to output via e.g. --ifs = --ofs :. Or, you can specify that the same separator is to be used for input and output via e.g. --fs :.
  • The pair separator is only relevant to DKVP format.
  • Pretty-print and xtab formats ignore the separator arguments altogether.
  • The --repifs means that multiple successive occurrences of the field separator count as one. For example, in CSV data we often signify nulls by empty strings, e.g. 2,9,,,,,6,5,4. On the other hand, if the field separator is a space, it might be more natural to parse 2 4 5 the same as 2 4 5: --repifs --ifs ' ' lets this happen. In fact, the --ipprint option above is internally implemented in terms of --repifs.
  • Just write out the desired separator, e.g. --ofs '|'. But you may use the symbolic names newline, space, tab, pipe, or semicolon if you like.

Number formatting

The command-line option --ofmt {format string} is the global number format for commands which generate numeric output, e.g. stats1, stats2, histogram, and step, as well as mlr put. Examples:

--ofmt %.9le  --ofmt %.6lf  --ofmt %.0lf

These are just C printf formats applied to double-precision numbers. Please don’t use %s or %d. Additionally, if you use leading width (e.g. %18.12lf) then the output will contain embedded whitespace, which may not be what you want if you pipe the output to something else, particularly CSV. I use Miller’s pretty-print format (mlr --opprint) to column-align numerical data.

To apply formatting to a single field, overriding the global ofmt, use fmtnum function within mlr put. For example:

$ echo 'x=3.1,y=4.3' | mlr put '$z=fmtnum($x*$y,"%08lf")'
x=3.1,y=4.3,z=13.330000

$ echo 'x=0xffff,y=0xff' | mlr put '$z=fmtnum(int($x*$y),"%08llx")'
x=0xffff,y=0xff,z=00feff01

Input conversion from hexadecimal is done automatically on fields handled by mlr put and mlr filter as long as the field value begins with "0x". To apply output conversion to hexadecimal on a single column, you may use fmtnum, or the keystroke-saving hexfmt function. Example:

$ echo 'x=0xffff,y=0xff' | mlr put '$z=hexfmt($x*$y)'
x=0xffff,y=0xff,z=0xfeff01

Data transformations

bar

Cheesy bar-charting.

$ mlr bar -h
Usage: mlr bar [options]
Replaces a numeric field with a number of asterisks, allowing for cheesy
bar plots. These align best with --opprint or --oxtab output format.
Options:
-f   {a,b,c}      Field names to convert to bars.
-c   {character}  Fill character: default '*'.
-x   {character}  Out-of-bounds character: default '#'.
-b   {character}  Blank character: default '.'.
--lo {lo}         Lower-limit value for min-width bar: default '0.000000'.
--hi {hi}         Upper-limit value for max-width bar: default '100.000000'.
-w   {n}          Bar-field width: default '40'.
--auto            Automatically computes limits, ignoring --lo and --hi.
                  Holds all records in memory before producing any output.

$ mlr --opprint cat data/small
a   b   i x                   y
pan pan 1 0.3467901443380824  0.7268028627434533
eks pan 2 0.7586799647899636  0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan 5 0.5732889198020006  0.8636244699032729

$ mlr --opprint bar --lo 0 --hi 1 -f x,y data/small
a   b   i x                                        y
pan pan 1 *************........................... *****************************...........
eks pan 2 ******************************.......... ********************....................
wye wye 3 ********................................ *************...........................
eks wye 4 ***************......................... *****...................................
wye pan 5 **********************.................. **********************************......

$ mlr --opprint bar --lo 0.4 --hi 0.6 -f x,y data/small
a   b   i x                                        y
pan pan 1 #....................................... ***************************************#
eks pan 2 ***************************************# ************************................
wye wye 3 #....................................... #.......................................
eks wye 4 #....................................... #.......................................
wye pan 5 **********************************...... ***************************************#

$ mlr --opprint bar --auto -f x,y data/small
a   b   i x                                                           y
pan pan 1 [0.204603]**********..............................[0.75868] [0.134189]********************************........[0.863624]
eks pan 2 [0.204603]***************************************#[0.75868] [0.134189]*********************...................[0.863624]
wye wye 3 [0.204603]#.......................................[0.75868] [0.134189]***********.............................[0.863624]
eks wye 4 [0.204603]************............................[0.75868] [0.134189]#.......................................[0.863624]
wye pan 5 [0.204603]**************************..............[0.75868] [0.134189]***************************************#[0.863624]

bootstrap

$ mlr bootstrap --help
Usage: mlr bootstrap [options]
Emits an n-sample, with replacement, of the input records.
Options:
-n {number} Number of samples to output. Defaults to number of input records.
            Must be non-negative.
See also mlr sample and mlr shuffle.

The canonical use for bootstrap sampling is to put error bars on statistical quantities, such as mean. For example:

$ mlr --opprint stats1 -a mean,count -f u -g color data/colored-shapes.dkvp
color  u_mean   u_count
yellow 0.497129 1413
red    0.492560 4641
purple 0.494005 1142
green  0.504861 1109
blue   0.517717 1470
orange 0.490532 303

$ mlr --opprint bootstrap then stats1 -a mean,count -f u -g color data/colored-shapes.dkvp
color  u_mean   u_count
yellow 0.500651 1380
purple 0.501556 1111
green  0.503272 1068
red    0.493895 4702
blue   0.512529 1496
orange 0.521030 321

$ mlr --opprint bootstrap then stats1 -a mean,count -f u -g color data/colored-shapes.dkvp
color  u_mean   u_count
yellow 0.498046 1485
blue   0.513576 1417
red    0.492870 4595
orange 0.507697 307
green  0.496803 1075
purple 0.486337 1199

$ mlr --opprint bootstrap then stats1 -a mean,count -f u -g color data/colored-shapes.dkvp
color  u_mean   u_count
blue   0.522921 1447
red    0.490717 4617
yellow 0.496450 1419
purple 0.496523 1192
green  0.507569 1111
orange 0.468014 292

cat

Most useful for format conversions (see File formats), and concatenating multiple same-schema CSV files to have the same header:

$ mlr cat -h
Usage: mlr cat [options]
Passes input records directly to output. Most useful for format conversion.
Options:
-n        Prepend field "n" to each record with record-counter starting at 1
-N {name} Prepend field {name} to each record with record-counter starting at 1

$ cat data/a.csv
a,b,c
1,2,3
4,5,6

$ cat data/b.csv
a,b,c
7,8,9

$ mlr --csv cat data/a.csv data/b.csv
a,b,c
1,2,3
4,5,6
7,8,9

$ mlr --icsv --oxtab cat data/a.csv data/b.csv
a 1
b 2
c 3

a 4
b 5
c 6

a 7
b 8
c 9

$ mlr --csv cat -n data/a.csv data/b.csv
n,a,b,c
1,1,2,3
2,4,5,6
3,7,8,9

check

$ mlr check --help
Usage: mlr check
Consumes records without printing any output.
Useful for doing a well-formatted check on input data.

count-distinct

$ mlr count-distinct --help
Usage: mlr count-distinct [options]
-f {a,b,c}    Field names for distinct count.
-n            Show only the number of distinct values.
Prints number of records having distinct values for specified field names.
Same as uniq -c.

$ mlr count-distinct -f a,b then sort -nr count data/medium
a=zee,b=wye,count=455
a=pan,b=eks,count=429
a=pan,b=pan,count=427
a=wye,b=hat,count=426
a=hat,b=wye,count=423
a=pan,b=hat,count=417
a=eks,b=hat,count=417
a=eks,b=eks,count=413
a=pan,b=zee,count=413
a=zee,b=hat,count=409
a=eks,b=wye,count=407
a=zee,b=zee,count=403
a=pan,b=wye,count=395
a=wye,b=pan,count=392
a=zee,b=eks,count=391
a=zee,b=pan,count=389
a=hat,b=eks,count=389
a=wye,b=eks,count=386
a=hat,b=zee,count=385
a=wye,b=zee,count=385
a=hat,b=hat,count=381
a=wye,b=wye,count=377
a=eks,b=pan,count=371
a=hat,b=pan,count=363
a=eks,b=zee,count=357

$ mlr count-distinct -n -f a,b data/medium
count=25

cut

$ mlr cut --help
Usage: mlr cut [options]
Passes through input records with specified fields included/excluded.
-f {a,b,c}       Field names to include for cut.
-o               Retain fields in the order specified here in the argument list.
                 Default is to retain them in the order found in the input data.
-x|--complement  Exclude, rather than include, field names specified by -f.
-r               Treat field names as regular expressions. "ab", "a.*b" will
                 match any field name containing the substring "ab" or matching
                 "a.*b", respectively; anchors of the form "^ab$", "^a.*b$" may
                 be used. The -o flag is ignored when -r is present.
Examples:
  mlr cut -f hostname,status
  mlr cut -x -f hostname,status
  mlr cut -r -f '^status$,sda[0-9]'
  mlr cut -r -f '^status$,"sda[0-9]"'
  mlr cut -r -f '^status$,"sda[0-9]"i' (this is case-insensitive)

$ mlr --opprint cat data/small
a   b   i x                   y
pan pan 1 0.3467901443380824  0.7268028627434533
eks pan 2 0.7586799647899636  0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan 5 0.5732889198020006  0.8636244699032729

$ mlr --opprint cut -f y,x,i data/small
i x                   y
1 0.3467901443380824  0.7268028627434533
2 0.7586799647899636  0.5221511083334797
3 0.20460330576630303 0.33831852551664776
4 0.38139939387114097 0.13418874328430463
5 0.5732889198020006  0.8636244699032729

$ echo 'a=1,b=2,c=3' | mlr cut -f b,c,a
a=1,b=2,c=3

$ echo 'a=1,b=2,c=3' | mlr cut -o -f b,c,a
b=2,c=3,a=1

decimate

$ mlr decimate --help
Usage: mlr decimate [options]
-n {count}    Decimation factor; default 10
-b            Decimate by printing first of every n.
-e            Decimate by printing last of every n (default).
-g {a,b,c}    Optional group-by-field names for decimate counts
Passes through one of every n records, optionally by category.

filter

$ mlr filter --help
Usage: mlr filter [options] {expression}
Prints records for which {expression} evaluates to true.

Options:
-v: First prints the AST (abstract syntax tree) for the expression, which gives
    full transparency on the precedence and associativity rules of Miller's
    grammar.
-t: Print low-level parser-trace to stderr.
-x: Prints records for which {expression} evaluates to false.
-S: Keeps field values, or literals in the expression, as strings with no type
    inference to int or float.
-F: Keeps field values, or literals in the expression, as strings or floats
    with no inference to int.
--oflatsep {string}: Separator to use when flattening multi-level @-variables
    to output records for emit. Default ":".
-f {filename}: the DSL expression is taken from the specified file rather
    than from the command line. Outer single quotes wrapping the expression
    should not be placed in the file. If -f is specified more than once,
    all input files specified using -f are concatenated to produce the expression.
    (For example, you can define functions in one file and call them from another.)
-e {expression}: You can use this after -f to add an expression. Example use
    case: define functions/subroutines in a file you specify with -f, then call
    them with an expression you specify with -e.
--no-fflush: for emit, tee, print, and dump, don't call fflush() after every
    record.
Any of the output-format command-line flags (see mlr -h). Example: using
  mlr --icsv --opprint ... then put --ojson 'tee > "mytap-".$a.".dat", $*' then ...
the input is CSV, the output is pretty-print tabular, but the tee-file output
is written in JSON format.

Please use a dollar sign for field names and double-quotes for string
literals. If field names have special characters such as "." then you might
use braces, e.g. '${field.name}'. Miller built-in variables are
NF NR FNR FILENUM FILENAME PI E, and ENV["namegoeshere"] to access environment
variables. The environment-variable name may be an expression, e.g. a field
value.

Use # to comment to end of line.

Examples:
  mlr filter 'log10($count) > 4.0'
  mlr filter 'FNR == 2          (second record in each file)'
  mlr filter 'urand() < 0.001'  (subsampling)
  mlr filter '$color != "blue" && $value > 4.2'
  mlr filter '($x<.5 && $y<.5) || ($x>.5 && $y>.5)'
  mlr filter '($name =~ "^sys.*east$") || ($name =~ "^dev.[0-9]+"i)'
  mlr filter '
    NR == 1 ||
   #NR == 2 ||
    NR == 3
  '

Please see http://johnkerl.org/miller/doc/reference.html for more information
including function list. Or "mlr -f". Please also also "mlr grep" which is
useful when you don't yet know which field name(s) you're looking for.

Features which filter shares with put

Please see Expression language for filter and put for more information about the expression language for mlr filter.

grep

$ mlr grep -h
Usage: mlr grep [options] {regular expression}
Passes through records which match {regex}.
Options:
-i    Use case-insensitive search.
-v    Invert: pass through records which do not match the regex.
Note that "mlr filter" is more powerful, but requires you to know field names.
By contrast, "mlr grep" allows you to regex-match the entire record. It does
this by formatting each record in memory as DKVP, using command-line-specified
ORS/OFS/OPS, and matching the resulting line against the regex specified
here. In particular, the regex is not applied to the input stream: if you
have CSV with header line "x,y,z" and data line "1,2,3" then the regex will
be matched, not against either of these lines, but against the DKVP line
"x=1,y=2,z=3".  Furthermore, not all the options to system grep are supported,
and this command is intended to be merely a keystroke-saver. To get all the
features of system grep, you can do
  "mlr --odkvp ... | grep ... | mlr --idkvp ..."

group-by

$ mlr group-by --help
Usage: mlr group-by {comma-separated field names}
Outputs records in batches having identical values at specified field names.

This is similar to sort but with less work. Namely, Miller’s sort has three steps: read through the data and append linked lists of records, one for each unique combination of the key-field values; after all records are read, sort the key-field values; then print each record-list. The group-by operation simply omits the middle sort. An example should make this more clear.

$ mlr --opprint group-by a data/small
a   b   i x                   y
pan pan 1 0.3467901443380824  0.7268028627434533
eks pan 2 0.7586799647899636  0.5221511083334797
eks wye 4 0.38139939387114097 0.13418874328430463
wye wye 3 0.20460330576630303 0.33831852551664776
wye pan 5 0.5732889198020006  0.8636244699032729

$ mlr --opprint sort -f a data/small
a   b   i x                   y
eks pan 2 0.7586799647899636  0.5221511083334797
eks wye 4 0.38139939387114097 0.13418874328430463
pan pan 1 0.3467901443380824  0.7268028627434533
wye wye 3 0.20460330576630303 0.33831852551664776
wye pan 5 0.5732889198020006  0.8636244699032729

In this example, since the sort is on field a, the first step is to group together all records having the same value for field a; the second step is to sort the distinct a-field values pan, eks, and wye into eks, pan, and wye; the third step is to print out the record-list for a=eks, then the record-list for a=pan, then the record-list for a=wye. The group-by operation omits the middle sort and just puts like records together, for those times when a sort isn’t desired. In particular, the ordering of group-by fields for group-by is the order in which they were encountered in the data stream, which in some cases may be more interesting to you.

group-like

$ mlr group-like --help
Usage: mlr group-like
Outputs records in batches having identical field names.

This groups together records having the same schema (i.e. same ordered list of field names) which is useful for making sense of time-ordered output as described in Record-heterogeneity — in particular, in preparation for CSV or pretty-print output.

$ mlr cat data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true
record_count=100,resource=/path/to/file
resource=/path/to/second/file,loadsec=0.32,ok=true
record_count=150,resource=/path/to/second/file
resource=/some/other/path,loadsec=0.97,ok=false

$ mlr --opprint group-like data/het.dkvp
resource             loadsec ok
/path/to/file        0.45    true
/path/to/second/file 0.32    true
/some/other/path     0.97    false

record_count resource
100          /path/to/file
150          /path/to/second/file

having-fields

$ mlr having-fields --help
Usage: mlr having-fields [options]
Conditionally passes through records depending on each record's field names.
Options:
  --at-least      {comma-separated names}
  --which-are     {comma-separated names}
  --at-most       {comma-separated names}
  --all-matching  {regular expression}
  --any-matching  {regular expression}
  --none-matching {regular expression}
Examples:
  mlr having-fields --which-are amount,status,owner
  mlr having-fields --any-matching 'sda[0-9]'
  mlr having-fields --any-matching '"sda[0-9]"'
  mlr having-fields --any-matching '"sda[0-9]"i' (this is case-insensitive)

Similar to group-like, this retains records with specified schema.

$ mlr cat data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true
record_count=100,resource=/path/to/file
resource=/path/to/second/file,loadsec=0.32,ok=true
record_count=150,resource=/path/to/second/file
resource=/some/other/path,loadsec=0.97,ok=false

$ mlr having-fields --at-least resource data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true
record_count=100,resource=/path/to/file
resource=/path/to/second/file,loadsec=0.32,ok=true
record_count=150,resource=/path/to/second/file
resource=/some/other/path,loadsec=0.97,ok=false

$ mlr having-fields --which-are resource,ok,loadsec data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true
resource=/path/to/second/file,loadsec=0.32,ok=true
resource=/some/other/path,loadsec=0.97,ok=false

head

$ mlr head --help
Usage: mlr head [options]
-n {count}    Head count to print; default 10
-g {a,b,c}    Optional group-by-field names for head counts
Passes through the first n records, optionally by category.
Without -g, ceases consuming more input (i.e. is fast) when n
records have been read.

Note that head is distinct from tophead shows fields which appear first in the data stream; top shows fields which are numerically largest (or smallest).

$ mlr --opprint head -n 4 data/medium
a   b   i x                   y
pan pan 1 0.3467901443380824  0.7268028627434533
eks pan 2 0.7586799647899636  0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463

$ mlr --opprint head -n 1 -g b data/medium
a   b   i  x                   y
pan pan 1  0.3467901443380824  0.7268028627434533
wye wye 3  0.20460330576630303 0.33831852551664776
eks zee 7  0.6117840605678454  0.1878849191181694
zee eks 17 0.29081949506712723 0.054478717073354166
wye hat 24 0.7286126830627567  0.19441962592638418

histogram

$ mlr histogram --help
Usage: mlr histogram [options]
-f {a,b,c}    Value-field names for histogram counts
--lo {lo}     Histogram low value
--hi {hi}     Histogram high value
--nbins {n}   Number of histogram bins
--auto        Automatically computes limits, ignoring --lo and --hi.
              Holds all values in memory before producing any output.
Just a histogram. Input values < lo or > hi are not counted.

This is just a histogram; there’s not too much to say here. A note about binning, by example: Suppose you use --lo 0.0 --hi 1.0 --nbins 10 -f x. The input numbers less than 0 or greater than 1 aren’t counted in any bin. Input numbers equal to 1 are counted in the last bin. That is, bin 0 has 0.0 ≤ x < 0.1, bin 1 has 0.1 ≤ x < 0.2, etc., but bin 9 has 0.9 ≤ x ≤ 1.0.

$ mlr --opprint put '$x2=$x**2;$x3=$x2*$x' then histogram -f x,x2,x3 --lo 0 --hi 1 --nbins 10 data/medium
bin_lo   bin_hi   x_count x2_count x3_count
0.000000 0.100000 1072    3231     4661
0.100000 0.200000 938     1254     1184
0.200000 0.300000 1037    988      845
0.300000 0.400000 988     832      676
0.400000 0.500000 950     774      576
0.500000 0.600000 1002    692      476
0.600000 0.700000 1007    591      438
0.700000 0.800000 1007    560      420
0.800000 0.900000 986     571      383
0.900000 1.000000 1013    507      341

join

$ mlr join --help
Usage: mlr join [options]
Joins records from specified left file name with records from all file names
at the end of the Miller argument list.
Functionality is essentially the same as the system "join" command, but for
record streams.
Options:
  -f {left file name}
  -j {a,b,c}   Comma-separated join-field names for output
  -l {a,b,c}   Comma-separated join-field names for left input file;
               defaults to -j values if omitted.
  -r {a,b,c}   Comma-separated join-field names for right input file(s);
               defaults to -j values if omitted.
  --lp {text}  Additional prefix for non-join output field names from
               the left file
  --rp {text}  Additional prefix for non-join output field names from
               the right file(s)
  --np         Do not emit paired records
  --ul         Emit unpaired records from the left file
  --ur         Emit unpaired records from the right file(s)
  -u           Enable unsorted input. In this case, the entire left file will
               be loaded into memory. Without -u, records must be sorted
               lexically by their join-field names, else not all records will
               be paired.
  --prepipe {command} As in main input options; see mlr --help for details.
               If you wish to use a prepipe command for the main input as well
               as here, it must be specified there as well as here.
File-format options default to those for the right file names on the Miller
argument list, but may be overridden for the left file as follows. Please see
the main "mlr --help" for more information on syntax for these arguments.
  -i {one of csv,dkvp,nidx,pprint,xtab}
  --irs {record-separator character}
  --ifs {field-separator character}
  --ips {pair-separator character}
  --repifs
  --repips
  --use-mmap
  --no-mmap
Please use "mlr --usage-separator-options" for information on specifying separators.
Please see http://johnkerl.org/miller/doc/reference.html for more information
including examples.

Examples:

Join larger table with IDs with smaller ID-to-name lookup table, showing only paired records:

$ mlr --icsvlite --opprint cat data/join-left-example.csv
id  name
100 alice
200 bob
300 carol
400 david
500 edgar

$ mlr --icsvlite --opprint cat data/join-right-example.csv
status  idcode
present 400
present 100
missing 200
present 100
present 200
missing 100
missing 200
present 300
missing 600
present 400
present 400
present 300
present 100
missing 400
present 200
present 200
present 200
present 200
present 400
present 300

$ mlr --icsvlite --opprint join -u -j id -r idcode -f data/join-left-example.csv data/join-right-example.csv
id  name  status
400 david present
100 alice present
200 bob   missing
100 alice present
200 bob   present
100 alice missing
200 bob   missing
300 carol present
400 david present
400 david present
300 carol present
100 alice present
400 david missing
200 bob   present
200 bob   present
200 bob   present
200 bob   present
400 david present
300 carol present

Same, but with sorting the input first:

$ mlr --icsvlite --opprint sort -f idcode then join -j id -r idcode -f data/join-left-example.csv data/join-right-example.csv
id  name  status
100 alice present
100 alice present
100 alice missing
100 alice present
200 bob   missing
200 bob   present
200 bob   missing
200 bob   present
200 bob   present
200 bob   present
200 bob   present
300 carol present
300 carol present
300 carol present
400 david present
400 david present
400 david present
400 david missing
400 david present

Same, but showing only unpaired records:

$ mlr --icsvlite --opprint join --np --ul --ur -u -j id -r idcode -f data/join-left-example.csv data/join-right-example.csv
status  idcode
missing 600

id  name
500 edgar

Use prefixing options to disambiguate between otherwise identical non-join field names:

$ mlr --csvlite --opprint cat data/self-join.csv data/self-join.csv
a b c
1 2 3
1 4 5
1 2 3
1 4 5

$ mlr --csvlite --opprint join -j a --lp left_ --rp right_ -f data/self-join.csv data/self-join.csv
a left_b left_c right_b right_c
1 2      3      2       3
1 4      5      2       3
1 2      3      4       5
1 4      5      4       5

Use zero join columns:

$ mlr --csvlite --opprint join -j "" --lp left_ --rp right_ -f data/self-join.csv data/self-join.csv
left_a left_b left_c right_a right_b right_c
1      2      3      1       2       3
1      4      5      1       2       3
1      2      3      1       4       5
1      4      5      1       4       5

label

$ mlr label --help
Usage: mlr label {new1,new2,new3,...}
Given n comma-separated names, renames the first n fields of each record to
have the respective name. (Fields past the nth are left with their original
names.) Particularly useful with --inidx or --implicit-csv-header, to give
useful names to otherwise integer-indexed fields.
Examples:
  "echo 'a b c d' | mlr --inidx --odkvp cat"       gives "1=a,2=b,3=c,4=d"
  "echo 'a b c d' | mlr --inidx --odkvp label s,t" gives "s=a,t=b,3=c,4=d"

See also rename.

Example: Files such as /etc/passwd, /etc/group, and so on have implicit field names which are found in section-5 manpages. These field names may be made explicit as follows:

% grep -v '^#' /etc/passwd | mlr --nidx --fs : --opprint label name,password,uid,gid,gecos,home_dir,shell | head
name                  password uid gid gecos                         home_dir           shell
nobody                *        -2  -2  Unprivileged User             /var/empty         /usr/bin/false
root                  *        0   0   System Administrator          /var/root          /bin/sh
daemon                *        1   1   System Services               /var/root          /usr/bin/false
_uucp                 *        4   4   Unix to Unix Copy Protocol    /var/spool/uucp    /usr/sbin/uucico
_taskgated            *        13  13  Task Gate Daemon              /var/empty         /usr/bin/false
_networkd             *        24  24  Network Services              /var/networkd      /usr/bin/false
_installassistant     *        25  25  Install Assistant             /var/empty         /usr/bin/false
_lp                   *        26  26  Printing Services             /var/spool/cups    /usr/bin/false
_postfix              *        27  27  Postfix Mail Server           /var/spool/postfix /usr/bin/false

Likewise, if you have CSV/CSV-lite input data which has somehow been bereft of its header line, you can re-add a header line using --implicit-csv-header and label:

$ cat data/headerless.csv
John,23,present
Fred,34,present
Alice,56,missing
Carol,45,present

$ mlr --csv --rs lf --implicit-csv-header cat data/headerless.csv
1,2,3
John,23,present
Fred,34,present
Alice,56,missing
Carol,45,present

$ mlr --csv --rs lf --implicit-csv-header label name,age,status data/headerless.csv
name,age,status
John,23,present
Fred,34,present
Alice,56,missing
Carol,45,present

$ mlr --icsv --rs lf --implicit-csv-header --opprint label name,age,status data/headerless.csv
name  age status
John  23  present
Fred  34  present
Alice 56  missing
Carol 45  present

least-frequent

$ mlr least-frequent -h
Usage: mlr least-frequent [options]
Shows the least frequently occurring distinct values for specified field names.
The first entry is the statistical anti-mode; the remaining are runners-up.
Options:
-f {one or more comma-separated field names}. Required flag.
-n {count}. Optional flag defaulting to 10.
-b          Suppress counts; show only field values.
See also "mlr most".

$ mlr --opprint --from data/colored-shapes.dkvp least-frequent -f shape -n 5
shape    count
circle   2591
triangle 3372
square   4115

$ mlr --opprint --from data/colored-shapes.dkvp least-frequent -f shape,color -n 5
shape    color  count
circle   orange 68
triangle orange 107
square   orange 128
circle   green  287
circle   purple 289

$ mlr --opprint --from data/colored-shapes.dkvp least-frequent -f shape,color -n 5 -b
shape    color
circle   orange
triangle orange
square   orange
circle   green
circle   purple

See also most-frequent.

merge-fields

$ mlr merge-fields --help
Usage: mlr merge-fields [options]
Computes univariate statistics for each input record, accumulated across
specified fields.
Options:
-a {sum,count,...}  Names of accumulators. One or more of:
  count     Count instances of fields
  mode      Find most-frequently-occurring values for fields; first-found wins tie
  sum       Compute sums of specified fields
  mean      Compute averages (sample means) of specified fields
  stddev    Compute sample standard deviation of specified fields
  var       Compute sample variance of specified fields
  meaneb    Estimate error bars for averages (assuming no sample autocorrelation)
  skewness  Compute sample skewness of specified fields
  kurtosis  Compute sample kurtosis of specified fields
  min       Compute minimum values of specified fields
  max       Compute maximum values of specified fields
-f {a,b,c}  Value-field names on which to compute statistics. Requires -o.
-r {a,b,c}  Regular expressions for value-field names on which to compute
            statistics. Requires -o.
-c {a,b,c}  Substrings for collapse mode. All fields which have the same names
            after removing substrings will be accumulated together. Please see
            examples below.
-i          Use interpolated percentiles, like R's type=7; default like type=1.
-o {name}   Output field basename for -f/-r.
-k          Keep the input fields which contributed to the output statistics;
            the default is to omit them.
-F          Computes integerable things (e.g. count) in floating point.
Example input data: "a_in_x=1,a_out_x=2,b_in_y=4,b_out_x=8".
Example: mlr merge-fields -a sum,count -f a_in_x,a_out_x -o foo
  produces "b_in_y=4,b_out_x=8,foo_sum=3,foo_count=2" since "a_in_x,a_out_x" are
  summed over.
Example: mlr merge-fields -a sum,count -r in_,out_ -o bar
  produces "bar_sum=15,bar_count=4" since all four fields are summed over.
Example: mlr merge-fields -a sum,count -c in_,out_
  produces "a_x_sum=3,a_x_count=2,b_y_sum=4,b_y_count=1,b_x_sum=8,b_x_count=1"
  since "a_in_x" and "a_out_x" both collapse to "a_x", "b_in_y" collapses to
  "b_y", and "b_out_x" collapses to "b_x".

This is like mlr stats1 but all accumulation is done across fields within each given record: horizontal rather than vertical statistics, if you will.

Examples:

$ mlr --csvlite --opprint cat data/inout.csv
a_in a_out b_in b_out
436  490   446  195
526  320   963  780
220  888   705  831

$ mlr --csvlite --opprint merge-fields -a min,max,sum -c _in,_out data/inout.csv
a_min a_max a_sum b_min b_max b_sum
436   490   926   195   446   641
320   526   846   780   963   1743
220   888   1108  705   831   1536

$ mlr --csvlite --opprint merge-fields -k -a sum -c _in,_out data/inout.csv
a_in a_out b_in b_out a_sum b_sum
436  490   446  195   926   641
526  320   963  780   846   1743
220  888   705  831   1108  1536

most-frequent

$ mlr most-frequent -h
Usage: mlr most-frequent [options]
Shows the most frequently occurring distinct values for specified field names.
The first entry is the statistical mode; the remaining are runners-up.
Options:
-f {one or more comma-separated field names}. Required flag.
-n {count}. Optional flag defaulting to 10.
-b          Suppress counts; show only field values.
See also "mlr least".

$ mlr --opprint --from data/colored-shapes.dkvp most-frequent -f shape -n 5
shape    count
square   4115
triangle 3372
circle   2591

$ mlr --opprint --from data/colored-shapes.dkvp most-frequent -f shape,color -n 5
shape    color  count
square   red    1874
triangle red    1560
circle   red    1207
square   yellow 589
square   blue   589

$ mlr --opprint --from data/colored-shapes.dkvp most-frequent -f shape,color -n 5 -b
shape    color
square   red
triangle red
circle   red
square   yellow
square   blue

See also least-frequent.

nest

$ mlr nest -h
Usage: mlr nest [options]
Explodes specified field values into separate fields/records, or reverses this.
Options:
  --explode,--implode   One is required.
  --values,--pairs      One is required.
  --across-records,--across-fields One is required.
  -f {field name}       Required.
  --nested-fs {string}  Defaults to ";". Field separator for nested values.
  --nested-ps {string}  Defaults to ":". Pair separator for nested key-value pairs.
Please use "mlr --usage-separator-options" for information on specifying separators.

Examples:

  mlr nest --explode --values --across-records -f x
  with input record "x=a;b;c,y=d" produces output records
    "x=a,y=d"
    "x=b,y=d"
    "x=c,y=d"
  Use --implode to do the reverse.

  mlr nest --explode --values --across-fields -f x
  with input record "x=a;b;c,y=d" produces output records
    "x_1=a,x_2=b,x_3=c,y=d"
  Use --implode to do the reverse.

  mlr nest --explode --pairs --across-records -f x
  with input record "x=a:1;b:2;c:3,y=d" produces output records
    "a=1,y=d"
    "b=2,y=d"
    "c=3,y=d"

  mlr nest --explode --pairs --across-fields -f x
  with input record "x=a:1;b:2;c:3,y=d" produces output records
    "a=1,b=2,c=3,y=d"

Notes:
* With --pairs, --implode doesn't make sense since the original field name has
  been lost.
* The combination "--implode --values --across-records" is non-streaming:
  no output records are produced until all input records have been read. In
  particular, this means it won't work in tail -f contexts. But all other flag
  combinations result in streaming (tail -f friendly) data processing.
* It's up to you to ensure that the nested-fs is distinct from your data's IFS:
  e.g. by default the former is semicolon and the latter is comma.
See also mlr reshape.

nothing

$ mlr nothing -h
Usage: mlr nothing [options]
Drops all input records. Useful for testing, or after tee/print/etc. have
produced other output.

put

$ mlr put --help
Usage: mlr put [options] {expression}
Adds/updates specified field(s). Expressions are semicolon-separated and must
either be assignments, or evaluate to boolean.  Booleans with following
statements in curly braces control whether those statements are executed;
booleans without following curly braces do nothing except side effects (e.g.
regex-captures into \1, \2, etc.).

Options:
-v: First prints the AST (abstract syntax tree) for the expression, which gives
    full transparency on the precedence and associativity rules of Miller's
    grammar.
-t: Print low-level parser-trace to stderr.
-q: Does not include the modified record in the output stream. Useful for when
    all desired output is in begin and/or end blocks.
-S: Keeps field values, or literals in the expression, as strings with no type
    inference to int or float.
-F: Keeps field values, or literals in the expression, as strings or floats
    with no inference to int.
--oflatsep {string}: Separator to use when flattening multi-level @-variables
    to output records for emit. Default ":".
-f {filename}: the DSL expression is taken from the specified file rather
    than from the command line. Outer single quotes wrapping the expression
    should not be placed in the file. If -f is specified more than once,
    all input files specified using -f are concatenated to produce the expression.
    (For example, you can define functions in one file and call them from another.)
-e {expression}: You can use this after -f to add an expression. Example use
    case: define functions/subroutines in a file you specify with -f, then call
    them with an expression you specify with -e.
--no-fflush: for emit, tee, print, and dump, don't call fflush() after every
    record.
Any of the output-format command-line flags (see mlr -h). Example: using
  mlr --icsv --opprint ... then put --ojson 'tee > "mytap-".$a.".dat", $*' then ...
the input is CSV, the output is pretty-print tabular, but the tee-file output
is written in JSON format.

Please use a dollar sign for field names and double-quotes for string
literals. If field names have special characters such as "." then you might
use braces, e.g. '${field.name}'. Miller built-in variables are
NF NR FNR FILENUM FILENAME PI E, and ENV["namegoeshere"] to access environment
variables. The environment-variable name may be an expression, e.g. a field
value.

Use # to comment to end of line.

Examples:
  mlr put '$y = log10($x); $z = sqrt($y)'
  mlr put '$x>0.0 { $y=log10($x); $z=sqrt($y) }' # does {...} only if $x > 0.0
  mlr put '$x>0.0;  $y=log10($x); $z=sqrt($y)'   # does all three statements
  mlr put '$a =~ "([a-z]+)_([0-9]+);  $b = "left_\1"; $c = "right_\2"'
  mlr put '$a =~ "([a-z]+)_([0-9]+) { $b = "left_\1"; $c = "right_\2" }'
  mlr put '$filename = FILENAME'
  mlr put '$colored_shape = $color . "_" . $shape'
  mlr put '$y = cos($theta); $z = atan2($y, $x)'
  mlr put '$name = sub($name, "http.*com"i, "")'
  mlr put -q '@sum += $x; end {emit @sum}'
  mlr put -q '@sum[$a] += $x; end {emit @sum, "a"}'
  mlr put -q '@sum[$a][$b] += $x; end {emit @sum, "a", "b"}'
  mlr put -q '@min=min(@min,$x);@max=max(@max,$x); end{emitf @min, @max}'
  mlr put -q 'isnull(@xmax) || $x > @xmax {@xmax=$x; @recmax=$*}; end {emit @recmax}'
  mlr put '
    $x = 1;
   #$y = 2;
    $z = 3
  '

Please see also 'mlr -k' for examples using redirected output.

Please see http://johnkerl.org/miller/doc/reference.html for more information
including function list. Or "mlr -f".
Please see in particular:
  http://www.johnkerl.org/miller/doc/reference.html#put
Options:
-v: First prints the AST (abstract syntax tree) for the expression, which gives
    full transparency on the precedence and associativity rules of Miller's
    grammar.
-t: Print low-level parser-trace to stderr.
-q: Does not include the modified record in the output stream. Useful for when
    all desired output is in begin and/or end blocks.
-S: Keeps field values, or literals in the expression, as strings with no type
    inference to int or float.
-F: Keeps field values, or literals in the expression, as strings or floats
    with no inference to int.
--oflatsep {string}: Separator to use when flattening multi-level @-variables
    to output records for emit. Default ":".
-f {filename}: the DSL expression is taken from the specified file rather
    than from the command line. Outer single quotes wrapping the expression
    should not be placed in the file. If -f is specified more than once,
    all input files specified using -f are concatenated to produce the expression.
    (For example, you can define functions in one file and call them from another.)
-e {expression}: You can use this after -f to add an expression. Example use
    case: define functions/subroutines in a file you specify with -f, then call
    them with an expression you specify with -e.
--no-fflush: for emit, tee, print, and dump, don't call fflush() after every
    record.
Any of the output-format command-line flags (see mlr -h). Example: using
  mlr --icsv --opprint ... then put --ojson 'tee > "mytap-".$a.".dat", $*' then ...
the input is CSV, the output is pretty-print tabular, but the tee-file output
is written in JSON format.

Please use a dollar sign for field names and double-quotes for string
literals. If field names have special characters such as "." then you might
use braces, e.g. '${field.name}'. Miller built-in variables are
NF NR FNR FILENUM FILENAME PI E, and ENV["namegoeshere"] to access environment
variables. The environment-variable name may be an expression, e.g. a field
value.

Use # to comment to end of line.

Features which put shares with filter

Please see Expression language for filter and put for more information about the expression language for mlr put.

regularize

$ mlr regularize --help
Usage: mlr regularize
For records seen earlier in the data stream with same field names in
a different order, outputs them with field names in the previously
encountered order.
Example: input records a=1,c=2,b=3, then e=4,d=5, then c=7,a=6,b=8
output as              a=1,c=2,b=3, then e=4,d=5, then a=6,c=7,b=8

This exists since hash-map software in various languages and tools encountered in the wild does not always print similar rows with fields in the same order: mlr regularize helps clean that up.

See also reorder.

rename

$ mlr rename --help
Usage: mlr rename [options] {old1,new1,old2,new2,...}
Renames specified fields.
Options:
-r         Treat old field  names as regular expressions. "ab", "a.*b"
           will match any field name containing the substring "ab" or
           matching "a.*b", respectively; anchors of the form "^ab$",
           "^a.*b$" may be used. New field names may be plain strings,
           or may contain capture groups of the form "\1" through
           "\9". Wrapping the regex in double quotes is optional, but
           is required if you wish to follow it with 'i' to indicate
           case-insensitivity.
-g         Do global replacement within each field name rather than
           first-match replacement.
Examples:
mlr rename -f old_name,new_name'
mlr rename -f old_name_1,new_name_1,old_name_2,new_name_2'
mlr rename -r 'Date_[0-9]+,Date,'  Rename all such fields to be "Date"
mlr rename -r '"Date_[0-9]+",Date' Same
mlr rename -r 'Date_([0-9]+).*,\1' Rename all such fields to be of the form 20151015
mlr rename -r '"name"i,Name'       Rename "name", "Name", "NAME", etc. to "Name"

$ mlr --opprint cat data/small
a   b   i x                   y
pan pan 1 0.3467901443380824  0.7268028627434533
eks pan 2 0.7586799647899636  0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan 5 0.5732889198020006  0.8636244699032729

$ mlr --opprint rename i,INDEX,b,COLUMN2 data/small
a   COLUMN2 INDEX x                   y
pan pan     1     0.3467901443380824  0.7268028627434533
eks pan     2     0.7586799647899636  0.5221511083334797
wye wye     3     0.20460330576630303 0.33831852551664776
eks wye     4     0.38139939387114097 0.13418874328430463
wye pan     5     0.5732889198020006  0.8636244699032729

As discussed in Performance, sed is significantly faster than Miller at doing this. However, Miller is format-aware, so it knows to do renames only within specified field keys and not any others, nor in field values which may happen to contain the same pattern. Example:

$ sed 's/y/COLUMN5/g' data/small
a=pan,b=pan,i=1,x=0.3467901443380824,COLUMN5=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,COLUMN5=0.5221511083334797
a=wCOLUMN5e,b=wCOLUMN5e,i=3,x=0.20460330576630303,COLUMN5=0.33831852551664776
a=eks,b=wCOLUMN5e,i=4,x=0.38139939387114097,COLUMN5=0.13418874328430463
a=wCOLUMN5e,b=pan,i=5,x=0.5732889198020006,COLUMN5=0.8636244699032729

$ mlr rename y,COLUMN5 data/small
a=pan,b=pan,i=1,x=0.3467901443380824,COLUMN5=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,COLUMN5=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,COLUMN5=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,COLUMN5=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,COLUMN5=0.8636244699032729

See also label.

reorder

$ mlr reorder --help
Usage: mlr reorder [options]
-f {a,b,c}   Field names to reorder.
-e           Put specified field names at record end: default is to put
             them at record start.
Examples:
mlr reorder    -f a,b sends input record "d=4,b=2,a=1,c=3" to "a=1,b=2,d=4,c=3".
mlr reorder -e -f a,b sends input record "d=4,b=2,a=1,c=3" to "d=4,c=3,a=1,b=2".

This pivots specified field names to the start or end of the record — for example when you have highly multi-column data and you want to bring a field or two to the front of line where you can give a quick visual scan.

$ mlr --opprint cat data/small
a   b   i x                   y
pan pan 1 0.3467901443380824  0.7268028627434533
eks pan 2 0.7586799647899636  0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan 5 0.5732889198020006  0.8636244699032729

$ mlr --opprint reorder -f i,b data/small
i b   a   x                   y
1 pan pan 0.3467901443380824  0.7268028627434533
2 pan eks 0.7586799647899636  0.5221511083334797
3 wye wye 0.20460330576630303 0.33831852551664776
4 wye eks 0.38139939387114097 0.13418874328430463
5 pan wye 0.5732889198020006  0.8636244699032729

$ mlr --opprint reorder -e -f i,b data/small
a   x                   y                   i b
pan 0.3467901443380824  0.7268028627434533  1 pan
eks 0.7586799647899636  0.5221511083334797  2 pan
wye 0.20460330576630303 0.33831852551664776 3 wye
eks 0.38139939387114097 0.13418874328430463 4 wye
wye 0.5732889198020006  0.8636244699032729  5 pan

repeat

$ mlr repeat --help
Usage: mlr repeat [options]
Copies input records to output records multiple times.
Options must be exactly one of the following:
  -n {repeat count}  Repeat each input record this many times.
  -f {field name}    Same, but take the repeat count from the specified
                     field name of each input record.
Example:
  echo x=0 | mlr repeat -n 4 then put '$x=urand()'
produces:
 x=0.488189
 x=0.484973
 x=0.704983
 x=0.147311
Example:
  echo a=1,b=2,c=3 | mlr repeat -f b
produces:
  a=1,b=2,c=3
  a=1,b=2,c=3
Example:
  echo a=1,b=2,c=3 | mlr repeat -f c
produces:
  a=1,b=2,c=3
  a=1,b=2,c=3
  a=1,b=2,c=3

This is useful in at least two ways: one, as a data-generator as in the above example using urand(); two, for reconstructing individual samples from data which has been count-aggregated:

$ cat data/repeat-example.dat
color=blue,count=5
color=red,count=4
color=green,count=3

$ mlr repeat -f count then cut -x -f count data/repeat-example.dat
color=blue
color=blue
color=blue
color=blue
color=blue
color=red
color=red
color=red
color=red
color=green
color=green
color=green

After expansion with repeat, such data can then be sent on to stats1 -a mode, or (if the data are numeric) to stats1 -a p10,p50,p90, etc.

reshape

$ mlr reshape --help
Usage: mlr reshape [options]
Wide-to-long options:
  -i {input field names}   -o {key-field name,value-field name}
  -r {input field regexes} -o {key-field name,value-field name}
  These pivot/reshape the input data such that the input fields are removed
  and separate records are emitted for each key/value pair.
  Note: this works with tail -f and produces output records for each input
  record seen.
Long-to-wide options:
  -s {key-field name,value-field name}
  These pivot/reshape the input data to undo the wide-to-long operation.
  Note: this does not work with tail -f; it produces output records only after
  all input records have been read.

Examples:

  Input file "wide.txt":
    time       X           Y
    2009-01-01 0.65473572  2.4520609
    2009-01-02 -0.89248112 0.2154713
    2009-01-03 0.98012375  1.3179287

  mlr --pprint reshape -i X,Y -o item,value wide.txt
    time       item value
    2009-01-01 X    0.65473572
    2009-01-01 Y    2.4520609
    2009-01-02 X    -0.89248112
    2009-01-02 Y    0.2154713
    2009-01-03 X    0.98012375
    2009-01-03 Y    1.3179287

  mlr --pprint reshape -r '[A-Z]' -o item,value wide.txt
    time       item value
    2009-01-01 X    0.65473572
    2009-01-01 Y    2.4520609
    2009-01-02 X    -0.89248112
    2009-01-02 Y    0.2154713
    2009-01-03 X    0.98012375
    2009-01-03 Y    1.3179287

  Input file "long.txt":
    time       item value
    2009-01-01 X    0.65473572
    2009-01-01 Y    2.4520609
    2009-01-02 X    -0.89248112
    2009-01-02 Y    0.2154713
    2009-01-03 X    0.98012375
    2009-01-03 Y    1.3179287

  mlr --pprint reshape -s item,value long.txt
    time       X           Y
    2009-01-01 0.65473572  2.4520609
    2009-01-02 -0.89248112 0.2154713
    2009-01-03 0.98012375  1.3179287
See also mlr nest.

sample

$ mlr sample --help
Usage: mlr sample [options]
Reservoir sampling (subsampling without replacement), optionally by category.
-k {count}    Required: number of records to output, total, or by group if using -g.
-g {a,b,c}    Optional: group-by-field names for samples.
See also mlr bootstrap and mlr shuffle.

This is reservoir-sampling: select k items from n with uniform probability and no repeats in the sample. (If n is less than k, then of course only n samples are produced.) With -g {field names}, produce a k-sample for each distinct value of the specified field names.

$ mlr --opprint sample -k 4 data/colored-shapes.dkvp
color  shape    flag i     u                   v                    w                   x
purple triangle 0    90122 0.9986871176198068  0.3037738877233719   0.5154934457238382  5.365962021016529
red    circle   0    3139  0.04835898233323954 -0.03964684310055758 0.5263660881848111  5.3758779366493625
orange triangle 0    67847 0.36746306902109926 0.5161574810505635   0.5176199566173642  3.1748088656576567
yellow square   1    33576 0.3098376725521097  0.8525628505287842   0.49774122460981685 4.494754378604669

$ mlr --opprint sample -k 4 data/colored-shapes.dkvp
color  shape  flag i     u                     v                   w                   x
blue   square 1    16783 0.09974385090654347   0.7243899920872646  0.5353718443278438  4.431057737383438
orange square 1    93291 0.5944176543007182    0.17744449786454086 0.49262281749172077 3.1548117990710653
yellow square 1    54436 0.5268161165014636    0.8785588662666121  0.5058773791931063  7.019185838783636
yellow square 1    55491 0.0025440267883102274 0.05474106287787284 0.5102729153751984  3.526301273728043

$ mlr --opprint sample -k 2 -g color data/colored-shapes.dkvp
color  shape    flag i     u                    v                   w                    x
yellow triangle 1    11    0.6321695890307647   0.9887207810889004  0.4364983936735774   5.7981881667050565
yellow square   1    917   0.8547010348386344   0.7356782810796262  0.4531511689924275   5.774541777078352
red    circle   1    4000  0.05490416175132373  0.07392337815122155 0.49416101516594396  5.355725080701707
red    square   0    87506 0.6357719216821314   0.6970867759393995  0.4940826462055272   6.351579417310387
purple triangle 0    14898 0.7800986870203719   0.23998073813992293 0.5014775988383656   3.141006771777843
purple triangle 0    151   0.032614487569017414 0.7346633365041219  0.7812143304483805   2.6831992610568047
green  triangle 1    126   0.1513010528347546   0.40346767294704544 0.051213231883952326 5.955109300797182
green  circle   0    17635 0.029856606049114442 0.4724542934246524  0.49529606749929744  5.239153910272168
blue   circle   1    1020  0.414263129226617    0.8304946402876182  0.13151094520189244  4.397873687920433
blue   triangle 0    220   0.441773289968473    0.44597731903759075 0.6329360666849821   4.3064608776550894
orange square   0    1885  0.8079311983747106   0.8685956833908394  0.3116410800256374   4.390864584500387
orange triangle 0    1533  0.32904497195507487  0.23168161807490417 0.8722623057355134   5.164071635714438

$ mlr --opprint sample -k 2 -g color then sort -f color data/colored-shapes.dkvp
color  shape    flag i     u                   v                    w                   x
blue   circle   0    215   0.7803586969333292  0.33146680638888126  0.04289047852629113 5.725365736377487
blue   circle   1    3616  0.8548431579124808  0.4989623130006362   0.3339426415875795  3.696785877560498
green  square   0    356   0.7674272008085286  0.341578843118008    0.4570224877870851  4.830320062215299
green  square   0    152   0.6684429446914862  0.016056003736548696 0.4656148241291592  5.434588759225423
orange triangle 0    587   0.5175826237797857  0.08989091493635304  0.9011709461770973  4.265854207755811
orange triangle 0    1533  0.32904497195507487 0.23168161807490417  0.8722623057355134  5.164071635714438
purple triangle 0    14192 0.5196327866973567  0.7860928603468063   0.4964368415453642  4.899167143824484
purple triangle 0    65    0.6842806710360729  0.5823723856331258   0.8014053396013747  5.805148213865135
red    square   1    2431  0.38378504852300466 0.11445015005595527  0.49355539228753786 5.146756570128739
red    triangle 0    57097 0.43763430414406546 0.3355450325004481   0.5322349637512487  4.144267240289442
yellow triangle 1    11    0.6321695890307647  0.9887207810889004   0.4364983936735774  5.7981881667050565
yellow square   1    158   0.41527900739142165 0.7118027080775757   0.4200799665161291  5.33279067554884

Note that no output is produced until all inputs are in. Another way to do sampling, which works in the streaming case, is mlr filter 'urand() & 0.001' where you tune the 0.001 to meet your needs.

sec2gmt

$ mlr sec2gmt -h
Usage: mlr sec2gmt {comma-separated list of field names}
Replaces a numeric field representing seconds since the epoch with the
corresponding GMT timestamp; leaves non-numbers as-is. This is nothing
more than a keystroke-saver for the sec2gmt function:
  mlr sec2gmt time1,time2
is the same as
  mlr put '$time1=sec2gmt($time1);$time2=sec2gmt($time2)'

sec2gmtdate

$ mlr sec2gmtdate -h
Usage: mlr sec2gmtdate {comma-separated list of field names}
Replaces a numeric field representing seconds since the epoch with the
corresponding GMT year-month-day timestamp; leaves non-numbers as-is.
This is nothing more than a keystroke-saver for the sec2gmtdate function:
  mlr sec2gmtdate time1,time2
is the same as
  mlr put '$time1=sec2gmtdate($time1);$time2=sec2gmtdate($time2)'

seqgen

$ mlr seqgen -h
Usage: mlr seqgen [options]
Produces a sequence of counters.  Discards the input record stream. Produces
output as specified by the following options:
-f {name} Field name for counters; default "i".
--start {number} Inclusive start value; default "1".
--stop  {number} Inclusive stop value; default "100".
--step  {number} Step value; default "1".
Start, stop, and/or step may be floating-point. Output is integer if start,
stop, and step are all integers. Step may be negative. It may not be zero
unless start == stop.

$ mlr seqgen --stop 10
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=9
i=10

$ mlr seqgen --start 20 --stop 40 --step 4
i=20
i=24
i=28
i=32
i=36
i=40

$ mlr seqgen --start 40 --stop 20 --step -4
i=40
i=36
i=32
i=28
i=24
i=20

shuffle

$ mlr shuffle -h
Usage: mlr shuffle {no options}
Outputs records randomly permuted. No output records are produced until
all input records are read.
See also mlr bootstrap and mlr sample.

sort

$ mlr sort --help
Usage: mlr sort {flags}
Flags:
  -f  {comma-separated field names}  Lexical ascending
  -n  {comma-separated field names}  Numerical ascending; nulls sort last
  -nf {comma-separated field names}  Numerical ascending; nulls sort last
  -r  {comma-separated field names}  Lexical descending
  -nr {comma-separated field names}  Numerical descending; nulls sort first
Sorts records primarily by the first specified field, secondarily by the second
field, and so on.  Any records not having all specified sort keys will appear
at the end of the output, in the order they were encountered, regardless of the
specified sort order.
Example:
  mlr sort -f a,b -nr x,y,z
which is the same as:
  mlr sort -f a -f b -nr x -nr y -nr z

Example:

$ mlr --opprint sort -f a -nr x data/small
a   b   i x                   y
eks pan 2 0.7586799647899636  0.5221511083334797
eks wye 4 0.38139939387114097 0.13418874328430463
pan pan 1 0.3467901443380824  0.7268028627434533
wye pan 5 0.5732889198020006  0.8636244699032729
wye wye 3 0.20460330576630303 0.33831852551664776

Here’s an example filtering log data: suppose multiple threads (labeled here by color) are all logging progress counts to a single log file. The log file is (by nature) chronological, so the progress of various threads is interleaved:

$ head -n 10 data/multicountdown.dat
upsec=0.002,color=green,count=1203
upsec=0.083,color=red,count=3817
upsec=0.188,color=red,count=3801
upsec=0.395,color=blue,count=2697
upsec=0.526,color=purple,count=953
upsec=0.671,color=blue,count=2684
upsec=0.899,color=purple,count=926
upsec=0.912,color=red,count=3798
upsec=1.093,color=blue,count=2662
upsec=1.327,color=purple,count=917

We can group these by thread by sorting on the thread ID (here, color). Since Miller’s sort is stable, this means that timestamps within each thread’s log data are still chronological:

$ head -n 20 data/multicountdown.dat | mlr --opprint sort -f color
upsec              color  count
0.395              blue   2697
0.671              blue   2684
1.093              blue   2662
2.064              blue   2659
2.2880000000000003 blue   2647
0.002              green  1203
1.407              green  1187
1.448              green  1177
2.313              green  1161
0.526              purple 953
0.899              purple 926
1.327              purple 917
1.703              purple 908
0.083              red    3817
0.188              red    3801
0.912              red    3798
1.416              red    3788
1.587              red    3782
1.601              red    3755
1.832              red    3717

Any records not having all specified sort keys will appear at the end of the output, in the order they were encountered, regardless of the specified sort order:

$ mlr sort -n  x data/sort-missing.dkvp
x=1
x=2
x=4
a=3

$ mlr sort -nr x data/sort-missing.dkvp
x=4
x=2
x=1
a=3

stats1

$ mlr stats1 --help
Usage: mlr stats1 [options]
Computes univariate statistics for one or more given fields, accumulated across
the input record stream.
Options:
-a {sum,count,...}  Names of accumulators: p10 p25.2 p50 p98 p100 etc. and/or
                    one or more of:
  count     Count instances of fields
  mode      Find most-frequently-occurring values for fields; first-found wins tie
  sum       Compute sums of specified fields
  mean      Compute averages (sample means) of specified fields
  stddev    Compute sample standard deviation of specified fields
  var       Compute sample variance of specified fields
  meaneb    Estimate error bars for averages (assuming no sample autocorrelation)
  skewness  Compute sample skewness of specified fields
  kurtosis  Compute sample kurtosis of specified fields
  min       Compute minimum values of specified fields
  max       Compute maximum values of specified fields
-f {a,b,c}  Value-field names on which to compute statistics
-g {d,e,f}  Optional group-by-field names
-i          Use interpolated percentiles, like R's type=7; default like type=1.
-s          Print iterative stats. Useful in tail -f contexts (in which
            case please avoid pprint-format output since end of input
            stream will never be seen).
-F          Computes integerable things (e.g. count) in floating point.
Example: mlr stats1 -a min,p10,p50,p90,max -f value -g size,shape
Example: mlr stats1 -a count,mode -f size
Example: mlr stats1 -a count,mode -f size -g shape
Notes:
* p50 is a synonym for median.
* min and max output the same results as p0 and p100, respectively, but use
  less memory.
* count and mode allow text input; the rest require numeric input.
  In particular, 1 and 1.0 are distinct text for count and mode.
* When there are mode ties, the first-encountered datum wins.

These are simple univariate statistics on one or more number-valued fields (count and mode apply to non-numeric fields as well), optionally categorized by one or more other fields.

$ mlr --oxtab stats1 -a count,sum,min,p10,p50,mean,p90,max -f x,y data/medium
x_count 10000
x_sum   4986.019682
x_min   0.000045
x_p10   0.093322
x_p50   0.501159
x_mean  0.498602
x_p90   0.900794
x_max   0.999953
y_count 10000
y_sum   5062.057445
y_min   0.000088
y_p10   0.102132
y_p50   0.506021
y_mean  0.506206
y_p90   0.905366
y_max   0.999965

$ mlr --opprint stats1 -a mean -f x,y -g b then sort -f b data/medium
b   x_mean   y_mean
eks 0.506361 0.510293
hat 0.487899 0.513118
pan 0.497304 0.499599
wye 0.497593 0.504596
zee 0.504242 0.502997

$ mlr --opprint stats1 -a p50,p99 -f u,v -g color then put '$ur=$u_p99/$u_p50;$vr=$v_p99/$v_p50' data/colored-shapes.dkvp
color  u_p50    u_p99    v_p50    v_p99    ur       vr
yellow 0.501019 0.989046 0.520630 0.987034 1.974069 1.895845
red    0.485038 0.990054 0.492586 0.994444 2.041189 2.018823
purple 0.501319 0.988893 0.504571 0.988287 1.972582 1.958668
green  0.502015 0.990764 0.505359 0.990175 1.973574 1.959350
blue   0.525226 0.992655 0.485170 0.993873 1.889958 2.048505
orange 0.483548 0.993635 0.480913 0.989102 2.054884 2.056717

$ mlr --opprint count-distinct -f shape then sort -nr count data/colored-shapes.dkvp
shape    count
square   4115
triangle 3372
circle   2591

$ mlr --opprint stats1 -a mode -f color -g shape data/colored-shapes.dkvp
shape    color_mode
triangle red
square   red
circle   red

stats2

$ mlr stats2 --help
Usage: mlr stats2 [options]
Computes bivariate statistics for one or more given field-name pairs,
accumulated across the input record stream.
-a {linreg-ols,corr,...}  Names of accumulators: one or more of:
  linreg-pca   Linear regression using principal component analysis
  linreg-ols   Linear regression using ordinary least squares
  r2           Quality metric for linreg-ols (linreg-pca emits its own)
  logireg      Logistic regression
  corr         Sample correlation
  cov          Sample covariance
  covx         Sample-covariance matrix
-f {a,b,c,d}   Value-field name-pairs on which to compute statistics.
               There must be an even number of names.
-g {e,f,g}     Optional group-by-field names.
-v             Print additional output for linreg-pca.
-s             Print iterative stats. Useful in tail -f contexts (in which
               case please avoid pprint-format output since end of input
               stream will never be seen).
--fit          Rather than printing regression parameters, applies them to
               the input data to compute new fit fields. All input records are
               held in memory until end of input stream. Has effect only for
               linreg-ols, linreg-pca, and logireg.
Only one of -s or --fit may be used.
Example: mlr stats2 -a linreg-pca -f x,y
Example: mlr stats2 -a linreg-ols,r2 -f x,y -g size,shape
Example: mlr stats2 -a corr -f x,y

These are simple bivariate statistics on one or more pairs of number-valued fields, optionally categorized by one or more fields.

$ mlr --oxtab put '$x2=$x*$x; $xy=$x*$y; $y2=$y**2' then stats2 -a cov,corr -f x,y,y,y,x2,xy,x2,y2 data/medium
x_y_cov    0.000043
x_y_corr   0.000504
y_y_cov    0.084611
y_y_corr   1.000000
x2_xy_cov  0.041884
x2_xy_corr 0.630174
x2_y2_cov  -0.000310
x2_y2_corr -0.003425

$ mlr --opprint put '$x2=$x*$x; $xy=$x*$y; $y2=$y**2' then stats2 -a linreg-ols,r2 -f x,y,y,y,xy,y2 -g a data/medium
a   x_y_ols_m x_y_ols_b x_y_ols_n x_y_r2   y_y_ols_m y_y_ols_b y_y_ols_n y_y_r2   xy_y2_ols_m xy_y2_ols_b xy_y2_ols_n xy_y2_r2
pan 0.017026  0.500403  2081      0.000287 1.000000  0.000000  2081      1.000000 0.878132    0.119082    2081        0.417498
eks 0.040780  0.481402  1965      0.001646 1.000000  0.000000  1965      1.000000 0.897873    0.107341    1965        0.455632
wye -0.039153 0.525510  1966      0.001505 1.000000  0.000000  1966      1.000000 0.853832    0.126745    1966        0.389917
zee 0.002781  0.504307  2047      0.000008 1.000000  0.000000  2047      1.000000 0.852444    0.124017    2047        0.393566
hat -0.018621 0.517901  1941      0.000352 1.000000  0.000000  1941      1.000000 0.841230    0.135573    1941        0.368794

Here’s an example simple line-fit. The x and y fields of the data/medium dataset are just independent uniformly distributed on the unit interval. Here we remove half the data and fit a line to it.


# Prepare input data:
mlr filter '($x<.5 && $y<.5) || ($x>.5 && $y>.5)' data/medium > data/medium-squares

# Do a linear regression and examine coefficients:
mlr --ofs newline stats2 -a linreg-pca -f x,y data/medium-squares
x_y_pca_m=1.014419
x_y_pca_b=0.000308
x_y_pca_quality=0.861354

# Option 1 to apply the regression coefficients and produce a linear fit:
#   Set x_y_pca_m and x_y_pca_b as shell variables:
eval $(mlr --ofs newline stats2 -a linreg-pca -f x,y data/medium-squares)
#   In addition to x and y, make a new yfit which is the line fit, then plot
#   using your favorite tool:
mlr --onidx put '$yfit='$x_y_pca_m'*$x+'$x_y_pca_b then cut -x -f a,b,i data/medium-squares \
  | pgr -p -title 'linreg-pca example' -xmin 0 -xmax 1 -ymin 0 -ymax 1

# Option 2 to apply the regression coefficients and produce a linear fit: use --fit option
mlr --onidx stats2 -a linreg-pca --fit -f x,y then cut -f a,b,i data/medium-squares \
  | pgr -p -title 'linreg-pca example' -xmin 0 -xmax 1 -ymin 0 -ymax 1

I use pgr for plotting; here’s a screenshot.

(Thanks Drew Kunas for a good conversation about PCA!)

Here’s an example estimating time-to-completion for a set of jobs. Input data comes from a log file, with number of work units left to do in the count field and accumulated seconds in the upsec field, labeled by the color field:

$ head -n 10 data/multicountdown.dat
upsec=0.002,color=green,count=1203
upsec=0.083,color=red,count=3817
upsec=0.188,color=red,count=3801
upsec=0.395,color=blue,count=2697
upsec=0.526,color=purple,count=953
upsec=0.671,color=blue,count=2684
upsec=0.899,color=purple,count=926
upsec=0.912,color=red,count=3798
upsec=1.093,color=blue,count=2662
upsec=1.327,color=purple,count=917

We can do a linear regression on count remaining as a function of time: with c = m*u+b we want to find the time when the count goes to zero, i.e. u=-b/m.

$ mlr --oxtab stats2 -a linreg-pca -f upsec,count -g color then put '$donesec = -$upsec_count_pca_b/$upsec_count_pca_m' data/multicountdown.dat
color                   green
upsec_count_pca_m       -32.756917
upsec_count_pca_b       1213.722730
upsec_count_pca_n       24
upsec_count_pca_quality 0.999984
donesec                 37.052410

color                   red
upsec_count_pca_m       -37.367646
upsec_count_pca_b       3810.133400
upsec_count_pca_n       30
upsec_count_pca_quality 0.999989
donesec                 101.963431

color                   blue
upsec_count_pca_m       -29.231212
upsec_count_pca_b       2698.932820
upsec_count_pca_n       25
upsec_count_pca_quality 0.999959
donesec                 92.330514

color                   purple
upsec_count_pca_m       -39.030097
upsec_count_pca_b       979.988341
upsec_count_pca_n       21
upsec_count_pca_quality 0.999991
donesec                 25.108529

step

$ mlr step --help
Usage: mlr step [options]
Computes values dependent on the previous record, optionally grouped
by category.

Options:
-a {delta,rsum,...}   Names of steppers: comma-separated, one or more of:
  delta    Compute differences in field(s) between successive records
  shift    Include value(s) in field(s) from previous record, if any
  from-first Compute differences in field(s) from first record
  ratio    Compute ratios in field(s) between successive records
  rsum     Compute running sums of field(s) between successive records
  counter  Count instances of field(s) between successive records
  ewma     Exponentially weighted moving average over successive records
-f {a,b,c} Value-field names on which to compute statistics
-g {d,e,f} Optional group-by-field names
-F         Computes integerable things (e.g. counter) in floating point.
-d {x,y,z} Weights for ewma. 1 means current sample gets all weight (no
           smoothing), near under under 1 is light smoothing, near over 0 is
           heavy smoothing. Multiple weights may be specified, e.g.
           "mlr step -a ewma -f sys_load -d 0.01,0.1,0.9". Default if omitted
           is "-d 0.5".
-o {a,b,c} Custom suffixes for EWMA output fields. If omitted, these default to
           the -d values. If supplied, the number of -o values must be the same
           as the number of -d values.

Examples:
  mlr step -a rsum -f request_size
  mlr step -a delta -f request_size -g hostname
  mlr step -a ewma -d 0.1,0.9 -f x,y
  mlr step -a ewma -d 0.1,0.9 -o smooth,rough -f x,y
  mlr step -a ewma -d 0.1,0.9 -o smooth,rough -f x,y -g group_name

Please see http://johnkerl.org/miller/doc/reference.html#filter or
https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average
for more information on EWMA.

Most Miller commands are record-at-a-time, with the exception of stats1, stats2, and histogram which compute aggregate output. The step command is intermediate: it allows the option of adding fields which are functions of fields from previous records. Rsum is short for running sum.

$ mlr --opprint step -a shift,delta,rsum,counter -f x data/medium | head -15
a   b   i     x                      y                      x_shift                x_delta   x_rsum      x_counter
pan pan 1     0.3467901443380824     0.7268028627434533     -                      0         0.346790    1
eks pan 2     0.7586799647899636     0.5221511083334797     0.3467901443380824     0.411890  1.105470    2
wye wye 3     0.20460330576630303    0.33831852551664776    0.7586799647899636     -0.554077 1.310073    3
eks wye 4     0.38139939387114097    0.13418874328430463    0.20460330576630303    0.176796  1.691473    4
wye pan 5     0.5732889198020006     0.8636244699032729     0.38139939387114097    0.191890  2.264762    5
zee pan 6     0.5271261600918548     0.49322128674835697    0.5732889198020006     -0.046163 2.791888    6
eks zee 7     0.6117840605678454     0.1878849191181694     0.5271261600918548     0.084658  3.403672    7
zee wye 8     0.5985540091064224     0.976181385699006      0.6117840605678454     -0.013230 4.002226    8
hat wye 9     0.03144187646093577    0.7495507603507059     0.5985540091064224     -0.567112 4.033668    9
pan wye 10    0.5026260055412137     0.9526183602969864     0.03144187646093577    0.471184  4.536294    10
pan pan 11    0.7930488423451967     0.6505816637259333     0.5026260055412137     0.290423  5.329343    11
zee pan 12    0.3676141320555616     0.23614420670296965    0.7930488423451967     -0.425435 5.696957    12
eks pan 13    0.4915175580479536     0.7709126592971468     0.3676141320555616     0.123903  6.188474    13
eks zee 14    0.5207382318405251     0.34141681118811673    0.4915175580479536     0.029221  6.709213    14

$ mlr --opprint step -a shift,delta,rsum,counter -f x -g a data/medium | head -15
a   b   i     x                      y                      x_shift                x_delta   x_rsum      x_counter
pan pan 1     0.3467901443380824     0.7268028627434533     -                      0         0.346790    1
eks pan 2     0.7586799647899636     0.5221511083334797     -                      0         0.758680    1
wye wye 3     0.20460330576630303    0.33831852551664776    -                      0         0.204603    1
eks wye 4     0.38139939387114097    0.13418874328430463    0.7586799647899636     -0.377281 1.140079    2
wye pan 5     0.5732889198020006     0.8636244699032729     0.20460330576630303    0.368686  0.777892    2
zee pan 6     0.5271261600918548     0.49322128674835697    -                      0         0.527126    1
eks zee 7     0.6117840605678454     0.1878849191181694     0.38139939387114097    0.230385  1.751863    3
zee wye 8     0.5985540091064224     0.976181385699006      0.5271261600918548     0.071428  1.125680    2
hat wye 9     0.03144187646093577    0.7495507603507059     -                      0         0.031442    1
pan wye 10    0.5026260055412137     0.9526183602969864     0.3467901443380824     0.155836  0.849416    2
pan pan 11    0.7930488423451967     0.6505816637259333     0.5026260055412137     0.290423  1.642465    3
zee pan 12    0.3676141320555616     0.23614420670296965    0.5985540091064224     -0.230940 1.493294    3
eks pan 13    0.4915175580479536     0.7709126592971468     0.6117840605678454     -0.120267 2.243381    4
eks zee 14    0.5207382318405251     0.34141681118811673    0.4915175580479536     0.029221  2.764119    5

$ mlr --opprint step -a ewma -f x -d 0.1,0.9 ../doc/data/medium | head -15
a   b   i     x                      y                      x_ewma_0.1 x_ewma_0.9
pan pan 1     0.3467901443380824     0.7268028627434533     0.346790   0.346790
eks pan 2     0.7586799647899636     0.5221511083334797     0.387979   0.717491
wye wye 3     0.20460330576630303    0.33831852551664776    0.369642   0.255892
eks wye 4     0.38139939387114097    0.13418874328430463    0.370817   0.368849
wye pan 5     0.5732889198020006     0.8636244699032729     0.391064   0.552845
zee pan 6     0.5271261600918548     0.49322128674835697    0.404671   0.529698
eks zee 7     0.6117840605678454     0.1878849191181694     0.425382   0.603575
zee wye 8     0.5985540091064224     0.976181385699006      0.442699   0.599056
hat wye 9     0.03144187646093577    0.7495507603507059     0.401573   0.088203
pan wye 10    0.5026260055412137     0.9526183602969864     0.411679   0.461184
pan pan 11    0.7930488423451967     0.6505816637259333     0.449816   0.759862
zee pan 12    0.3676141320555616     0.23614420670296965    0.441596   0.406839
eks pan 13    0.4915175580479536     0.7709126592971468     0.446588   0.483050
eks zee 14    0.5207382318405251     0.34141681118811673    0.454003   0.516969

$ mlr --opprint step -a ewma -f x -d 0.1,0.9 -o smooth,rough ../doc/data/medium | head -15
a   b   i     x                      y                      x_ewma_smooth x_ewma_rough
pan pan 1     0.3467901443380824     0.7268028627434533     0.346790      0.346790
eks pan 2     0.7586799647899636     0.5221511083334797     0.387979      0.717491
wye wye 3     0.20460330576630303    0.33831852551664776    0.369642      0.255892
eks wye 4     0.38139939387114097    0.13418874328430463    0.370817      0.368849
wye pan 5     0.5732889198020006     0.8636244699032729     0.391064      0.552845
zee pan 6     0.5271261600918548     0.49322128674835697    0.404671      0.529698
eks zee 7     0.6117840605678454     0.1878849191181694     0.425382      0.603575
zee wye 8     0.5985540091064224     0.976181385699006      0.442699      0.599056
hat wye 9     0.03144187646093577    0.7495507603507059     0.401573      0.088203
pan wye 10    0.5026260055412137     0.9526183602969864     0.411679      0.461184
pan pan 11    0.7930488423451967     0.6505816637259333     0.449816      0.759862
zee pan 12    0.3676141320555616     0.23614420670296965    0.441596      0.406839
eks pan 13    0.4915175580479536     0.7709126592971468     0.446588      0.483050
eks zee 14    0.5207382318405251     0.34141681118811673    0.454003      0.516969

Example deriving uptime-delta from system uptime:

$ each 10 uptime | mlr -p step -a delta -f 11
...
20:08 up 36 days, 10:38, 5 users, load averages: 1.42 1.62 1.73 0.000000
20:08 up 36 days, 10:38, 5 users, load averages: 1.55 1.64 1.74 0.020000
20:08 up 36 days, 10:38, 7 users, load averages: 1.58 1.65 1.74 0.010000
20:08 up 36 days, 10:38, 9 users, load averages: 1.78 1.69 1.76 0.040000
20:08 up 36 days, 10:39, 9 users, load averages: 2.12 1.76 1.78 0.070000
20:08 up 36 days, 10:39, 9 users, load averages: 2.51 1.85 1.81 0.090000
20:08 up 36 days, 10:39, 8 users, load averages: 2.79 1.92 1.83 0.070000
20:08 up 36 days, 10:39, 4 users, load averages: 2.64 1.90 1.83 -0.020000

tac

$ mlr tac --help
Usage: mlr tac
Prints records in reverse order from the order in which they were encountered.

Prints the records in the input stream in reverse order. Note: this requires Miller to retain all input records in memory before any output records are produced.

$ mlr --icsv --opprint cat data/a.csv
a b c
1 2 3
4 5 6

$ mlr --icsv --opprint cat data/b.csv
a b c
7 8 9

$ mlr --icsv --opprint tac data/a.csv data/b.csv
a b c
7 8 9
4 5 6
1 2 3

$ mlr --icsv --opprint put '$filename=FILENAME' then tac data/a.csv data/b.csv
a b c filename
7 8 9 data/b.csv
4 5 6 data/a.csv
1 2 3 data/a.csv

tail

$ mlr tail --help
Usage: mlr tail [options]
-n {count}    Tail count to print; default 10
-g {a,b,c}    Optional group-by-field names for tail counts
Passes through the last n records, optionally by category.

Prints the last n records in the input stream, optionally by category.

$ mlr --opprint tail -n 4 data/colored-shapes.dkvp
color  shape    flag i     u                    v                   w                   x
blue   square   1    99974 0.6189062525431605   0.2637962404841453  0.5311465405784674  6.210738209085753
blue   triangle 0    99976 0.008110504040268474 0.8267274952432482  0.4732962944898885  6.146956761817328
yellow triangle 0    99990 0.3839424618160777   0.55952913620132    0.5113763011485609  4.307973891915119
yellow circle   1    99994 0.764950884927175    0.25284227383991364 0.49969878539567425 5.013809741826425

$ mlr --opprint tail -n 1 -g shape data/colored-shapes.dkvp
color  shape    flag i     u                  v                   w                   x
yellow triangle 0    99990 0.3839424618160777 0.55952913620132    0.5113763011485609  4.307973891915119
blue   square   1    99974 0.6189062525431605 0.2637962404841453  0.5311465405784674  6.210738209085753
yellow circle   1    99994 0.764950884927175  0.25284227383991364 0.49969878539567425 5.013809741826425

tee

$ mlr tee --help
Usage: mlr tee [options] {filename}
Passes through input records (like mlr cat) but also writes to specified output
file, using output-format flags from the command line (e.g. --ocsv). See also
the "tee" keyword within mlr put, which allows data-dependent filenames.
Options:
-a:          append to existing file, if any, rather than overwriting.
--no-fflush: don't call fflush() after every record.
Any of the output-format command-line flags (see mlr -h). Example: using
  mlr --icsv --opprint put '...' then tee --ojson ./mytap.dat then stats1 ...
the input is CSV, the output is pretty-print tabular, but the tee-file output
is written in JSON format.

top

$ mlr top --help
Usage: mlr top [options]
-f {a,b,c}    Value-field names for top counts.
-g {d,e,f}    Optional group-by-field names for top counts.
-n {count}    How many records to print per category; default 1.
-a            Print all fields for top-value records; default is
              to print only value and group-by fields. Requires a single
              value-field name only.
--min         Print top smallest values; default is top largest values.
-F            Keep top values as floats even if they look like integers.
Prints the n records with smallest/largest values at specified fields,
optionally by category.

Note that top is distinct from headhead shows fields which appear first in the data stream; top shows fields which are numerically largest (or smallest).

$ mlr --opprint top -n 4 -f x data/medium
top_idx x_top
1       0.999953
2       0.999823
3       0.999733
4       0.999563

$ mlr --opprint top -n 2 -f x -g a then sort -f a data/medium
a   top_idx x_top
eks 1       0.998811
eks 2       0.998534
hat 1       0.999953
hat 2       0.999733
pan 1       0.999403
pan 2       0.999044
wye 1       0.999823
wye 2       0.999264
zee 1       0.999490
zee 2       0.999438

uniq

$ mlr uniq --help
Usage: mlr uniq [options]
-g {d,e,f}    Group-by-field names for uniq counts.
-c            Show repeat counts in addition to unique values.
-n            Show only the number of distinct values.
Prints distinct values for specified field names. With -c, same as
count-distinct. For uniq, -f is a synonym for -g.

$ wc -l data/colored-shapes.dkvp
   10078 data/colored-shapes.dkvp

$ mlr uniq -g color,shape data/colored-shapes.dkvp
color=yellow,shape=triangle
color=red,shape=square
color=red,shape=circle
color=purple,shape=triangle
color=yellow,shape=circle
color=purple,shape=square
color=yellow,shape=square
color=red,shape=triangle
color=green,shape=triangle
color=green,shape=square
color=blue,shape=circle
color=blue,shape=triangle
color=purple,shape=circle
color=blue,shape=square
color=green,shape=circle
color=orange,shape=triangle
color=orange,shape=square
color=orange,shape=circle

$ mlr --opprint uniq -g color,shape -c then sort -f color,shape data/colored-shapes.dkvp
color  shape    count
blue   circle   384
blue   square   589
blue   triangle 497
green  circle   287
green  square   454
green  triangle 368
orange circle   68
orange square   128
orange triangle 107
purple circle   289
purple square   481
purple triangle 372
red    circle   1207
red    square   1874
red    triangle 1560
yellow circle   356
yellow square   589
yellow triangle 468

$ mlr --opprint uniq -n -g color,shape data/colored-shapes.dkvp
count
18

Expression language for filter and put

The basic idea of mlr filter and mlr put are for record-selection and record-updating expressions, respectively. For example:

$ cat data/small
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729

$ mlr filter '$a == "eks"' data/small
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463

$ mlr put '$ab = $a . "_" . $b ' data/small
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533,ab=pan_pan
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797,ab=eks_pan
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776,ab=wye_wye
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463,ab=eks_wye
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729,ab=wye_pan

The two are essentially the same command. The only two differences are that expressions sent to mlr filter must end with a boolean expression, which is the filtering criterion, and that mlr filter expressions may not reference the filter keyword within them. All the rest is the same: in particular, you can define and invoke functions and subroutines to help produce the final boolean expression.

There are more details and more choices, of course, as detailed in the following sections.

Syntax

mlr put

Expression formatting

Multiple expressions may be given, separated by semicolons, and each may refer to the ones before:

$ ruby -e '10.times{|i|puts "i=#{i}"}' | mlr --opprint put '$j = $i + 1; $k = $i +$j'
i j  k
0 1  1
1 2  3
2 3  5
3 4  7
4 5  9
5 6  11
6 7  13
7 8  15
8 9  17
9 10 19

Newlines within the expression are ignored, which can help increase legibility of complex expressions:

$ mlr --opprint put '
  $nf       = NF;
  $nr       = NR;
  $fnr      = FNR;
  $filenum  = FILENUM;
  $filename = FILENAME
' data/small data/small2
a   b   i     x                    y                    nf nr fnr filenum filename
pan pan 1     0.3467901443380824   0.7268028627434533   5  1  1   1       data/small
eks pan 2     0.7586799647899636   0.5221511083334797   5  2  2   1       data/small
wye wye 3     0.20460330576630303  0.33831852551664776  5  3  3   1       data/small
eks wye 4     0.38139939387114097  0.13418874328430463  5  4  4   1       data/small
wye pan 5     0.5732889198020006   0.8636244699032729   5  5  5   1       data/small
pan eks 9999  0.267481232652199086 0.557077185510228001 5  6  1   2       data/small2
wye eks 10000 0.734806020620654365 0.884788571337605134 5  7  2   2       data/small2
pan wye 10001 0.870530722602517626 0.009854780514656930 5  8  3   2       data/small2
hat wye 10002 0.321507044286237609 0.568893318795083758 5  9  4   2       data/small2
pan zee 10003 0.272054845593895200 0.425789896597056627 5  10 5   2       data/small2

$ mlr --opprint filter '($x > 0.5 && $y < 0.5) || ($x < 0.5 && $y > 0.5)' then stats2 -a corr -f x,y data/medium
x_y_corr
-0.747994

Expressions from files

The simplest way to enter expressions for put and filter is between single quotes on the command line, e.g.

$ mlr --from data/small put '$xy = sqrt($x**2 + $y**2)'
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533,xy=0.805299
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797,xy=0.920998
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776,xy=0.395376
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463,xy=0.404317
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729,xy=1.036584

$ mlr --from data/small put 'func f(a, b) { return sqrt(a**2 + b**2) } $xy = f($x, $y)'
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533,xy=0.805299
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797,xy=0.920998
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776,xy=0.395376
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463,xy=0.404317
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729,xy=1.036584

You may, though, find it convenient to put expressions into files for reuse, and read them using the -f option. For example:

$ cat data/fe-example-3.mlr
func f(a, b) {
  return sqrt(a**2 + b**2)
}
$xy = f($x, $y)

$ mlr --from data/small put -f data/fe-example-3.mlr
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533,xy=0.805299
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797,xy=0.920998
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776,xy=0.395376
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463,xy=0.404317
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729,xy=1.036584

If you have some of the logic in a file and you want to write the rest on the command line, you can use the -f and -e options:

$ cat data/fe-example-4.mlr
func f(a, b) {
  return sqrt(a**2 + b**2)
}

$ mlr --from data/small put -f data/fe-example-4.mlr -e '$xy = f($x, $y)'
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533,xy=0.805299
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797,xy=0.920998
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776,xy=0.395376
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463,xy=0.404317
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729,xy=1.036584

A suggested use-case here is defining functions in files, and calling them from command-line expressions.

Moreover, you can have one or more -f expressions (maybe one function per file, for example) and one or more -e expressions on the command line. All the -f’s will be evaluated in order, then all the -e’s will be evaluated in order.

Semicolons, newlines, and curly braces

Miller uses semicolons as statement separators, not statement terminators. This means you can write:

mlr put 'x=1'
mlr put 'x=1;$y=2'
mlr put 'x=1;$y=2;'
mlr put 'x=1;;;;$y=2;'

Semicolons are optional after closing curly braces (which close conditionals and loops as discussed below).

$ echo x=1,y=2 | mlr put 'while (NF < 10) { $[NF+1] = ""}  $foo = "bar"'
x=1,y=2,3=,4=,5=,6=,7=,8=,9=,10=,foo=bar

$ echo x=1,y=2 | mlr put 'while (NF < 10) { $[NF+1] = ""}; $foo = "bar"'
x=1,y=2,3=,4=,5=,6=,7=,8=,9=,10=,foo=bar

Semicolons are required between statements even if those statements are on separate lines. Newlines are for your convenience but have no syntactic meaning: line endings do not terminate statements. For example, adjacent assignment statements must be separated by semicolons even if those statements are on separate lines:

mlr put '
  $x = 1
  $y = 2 # Syntax error
'

mlr put '
  $x = 1;
  $y = 2 # This is OK
'

Bodies for all compound statements must be enclosed in curly braces, even if the body is a single statement:

mlr put 'if ($x == 1) $y = 2' # Syntax error

mlr put 'if ($x == 1) { $y = 2 }' # This is OK

Bodies for compound statements may be empty:

mlr put 'if ($x == 1) { }' # This no-op is syntactically acceptable

Variables

Miller has the following kinds of variables:

Built-in variables such as NF, NF, FILENAME, PI, and E. These are all capital letters and are read-only (although some of them change value from one record to another).

Fields of stream records, accessed using the $ prefix. These refer to fields of the current data-stream record. For example, in echo x=1,y=2 | mlr put '$z = $x + $y', $x and $y refer to input fields, and $z refers to a new, computed output field. In a few contexts, presented below, you can refer to the entire record as $*.

Out-of-stream variables accessed using the @ prefix. These refer to data which persist from one record to the next, including in begin and end blocks (which execute before/after the record stream is consumed, respectively). You use them to remember values across records, such as sums, differences, counters, and so on. In a few contexts, presented below, you can refer to the entire out-of-stream-variables collection as @*.

Local variables are limited in scope and extent to the current statements being executed: these include function arguments, bound variables in for loops, and explicitly declared local variables.

Keywords are not variables, but since their names are reserved, you cannot use these names for local variables.

Built-in variables

These are written all in capital letters, such as NR, NF, FILENAME, and only a small, specific set of them is defined by Miller.

Miller supports the following five built-in variables for filter and put, all awk-inspired: NF, NR, FNR, FILENUM, and FILENAME, as well as the mathematical constants PI and E. Lastly, the ENV hashmap allows read access to environment variables, e.g. ENV["HOME"] or ENV["foo_".$hostname].

$ mlr filter 'FNR == 2' data/small*
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
1=pan,2=pan,3=1,4=0.3467901443380824,5=0.7268028627434533
a=wye,b=eks,i=10000,x=0.734806020620654365,y=0.884788571337605134

$ mlr put '$fnr = FNR' data/small*
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533,fnr=1
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797,fnr=2
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776,fnr=3
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463,fnr=4
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729,fnr=5
1=a,2=b,3=i,4=x,5=y,fnr=1
1=pan,2=pan,3=1,4=0.3467901443380824,5=0.7268028627434533,fnr=2
1=eks,2=pan,3=2,4=0.7586799647899636,5=0.5221511083334797,fnr=3
1=wye,2=wye,3=3,4=0.20460330576630303,5=0.33831852551664776,fnr=4
1=eks,2=wye,3=4,4=0.38139939387114097,5=0.13418874328430463,fnr=5
1=wye,2=pan,3=5,4=0.5732889198020006,5=0.8636244699032729,fnr=6
a=pan,b=eks,i=9999,x=0.267481232652199086,y=0.557077185510228001,fnr=1
a=wye,b=eks,i=10000,x=0.734806020620654365,y=0.884788571337605134,fnr=2
a=pan,b=wye,i=10001,x=0.870530722602517626,y=0.009854780514656930,fnr=3
a=hat,b=wye,i=10002,x=0.321507044286237609,y=0.568893318795083758,fnr=4
a=pan,b=zee,i=10003,x=0.272054845593895200,y=0.425789896597056627,fnr=5

Their values of NF, NR, FNR, FILENUM, and FILENAME change from one record to the next as Miller scans through your input data stream. The mathematical constants, of course, do not change; ENV is populated from the system environment variables at the time Miller starts and is read-only for the remainder of program execution.

Their scope is global: you can refer to them in any filter or put statement. Their values are assigned by the input-record reader:

$ mlr --csv put '$nr = NR' data/a.csv
a,b,c,nr
1,2,3,1
4,5,6,2

$ mlr --csv repeat -n 3 then put '$nr = NR' data/a.csv
a,b,c,nr
1,2,3,1
1,2,3,1
1,2,3,1
4,5,6,2
4,5,6,2
4,5,6,2

The extent is for the duration of the put/filter: in a begin statement (which executes before the first input record is consumed) you will find NR=1 and in an endstatement (which is executed after the last input record is consumed) you will find NR to be the total number of records ingested.

These are all read-only for the mlr put and mlr filter DSLs: they may be assigned from, e.g. $nr=NR, but they may not be assigned to: NR=100 is a syntax error.

Field names

Field names must be specified using a $ in filter and put expressions, even though the dollar signs don’t appear in the data stream. For integer-indexed data, this looks like awk’s $1,$2,$3, except that Miller allows non-numeric names such as $quantity or $hostname. (Likewise, enclose string literals in double quotes in filter expressions even though they don’t appear in file data. In particular, mlr filter '$x=="abc"' passes through the record x=abc.)

If field names have special characters such as . then you can use braces, e.g. '${field.name}'.

You may also use a computed field name in square brackets, e.g.

$ echo a=3,b=4 | mlr filter '$["x"] < 0.5'

$ echo s=green,t=blue,a=3,b=4 | mlr put '$[$s."_".$t] = $a * $b'
s=green,t=blue,a=3,b=4,green_blue=12

The names of record fields depend on the contents of your input data stream, and their values change from one record to the next as Miller scans through your input data stream.

Their extent is limited to the current record; their scope is the filter or put command in which they appear.

These are read-write: you can do $y=2*$x, $x=$x+1, etc.

Records are Miller’s output: field names present in the input stream are passed through to output (written to standard output) unless fields are removed with cut, or records are excluded with filter or put -q, etc. Simply assign a value to a field and it will be output.

Local variables

There are three kinds of local variables: arguments to functions/subroutines, variables bound within for-loops, and locals defined within control blocks.

This example shows all three:

$ # Here I'm using a specified random-number seed so this example always
# produces the same output for this web document: in everyday practice we
# would leave off the --seed 12345 part.
mlr --seed 12345 seqgen --start 1 --stop 10 then put '
  func f(a, b) {                          # function arguments a and b
      r = 0.0;                            # local r scoped to the function
      for (local i = 0; i < 6; i += 1) {  # local i scoped to the for-loop
          local u = urand();              # local u scoped to the for-loop
          r += u;                         # updates r from the enclosing scope
      }
      r /= 6;
      return a + (b - a) * r;
  }
  local o = f(10, 20);                    # local to the top-level scope
  $o = o;
'
i=1,o=14.662901
i=2,o=17.881983
i=3,o=14.586560
i=4,o=16.402409
i=5,o=16.336598
i=6,o=14.622701
i=7,o=15.983753
i=8,o=13.852177
i=9,o=15.472899
i=10,o=15.643912

Notes:

  • Parameter names are bound to their arguments but can be reassigned, e.g. if there is a parameter named a then you can reassign the value of a to be something else within the function if you like.
  • All argument-passing is positional rather than by name; arguments are passed by value, not by reference.
  • You can define locals (using the local keyword) at any scope (if-statementst, else-statements, while-loops, for-loops, or the top-level scope), and nested scopes will have access (more details on scope in the next section). If you define a local variable with the same name inside an inner scope, then a new variable is created with the narrower scope.
  • If you assign to a local variable for the first time in a scope without declaring it as local then: if it exists in an outer scope, that outer-scope variable will be updated; if not, it will be defined in the current scope as if local had been used. I recommend always declaring variables with local to make the intended scoping clear.
  • Functions and subroutines never have access to locals from their callee (unless passed by value as arguments).
  • If a for-loop variable is defined using local then it is scoped to that for-loop.

The following example demonstrates the scope rules:

$ cat data/scope-example.mlr
func f(a) {      # argument is local to the function
  local b = 100; # local to the function
  c = 100;       # local to the function; does not overwrite outer c
  return a + 1;
}
local a = 10;    # local at top level
local b = 20;    # local at top level
c = 30;          # local at top level; there is no more-outer-scope c
if (NR == 3) {
  local a = 40;  # scoped to the if-statement; doesn't overwrite outer a
  b = 50;        # not scoped to the if-statement; overwrites outer b
  c = 60;        # not scoped to the if-statement; overwrites outer c
  d = 70;        # there is no outer d so a local d is created here

  $inner_a = a;
  $inner_b = b;
  $inner_c = c;
  $inner_d = d;
}
$outer_a = a;
$outer_b = b;
$outer_c = c;
$outer_d = d;    # there is no outer d defined so no assignment happens

$ cat data/scope-example.dat
n=1,x=123
n=2,x=456
n=3,x=789

$ mlr --oxtab --from data/scope-example.dat put -f data/scope-example.mlr
n       1
x       123
outer_a 10
outer_b 20
outer_c 30

n       2
x       456
outer_a 10
outer_b 20
outer_c 30

n       3
x       789
inner_a 40
inner_b 50
inner_c 60
inner_d 70
outer_a 10
outer_b 50
outer_c 60

Out-of-stream variables for put

These are prefixed with an at-sign, e.g. @sum. Furthermore, unlike built-in variables and stream-record fields, they are maintained in an arbitrarily nested hashmap: you can do @sum += $quanity, or @sum[$color] += $quanity, or @sum[$color][$shape] += $quanity. The keys for the multi-level hashmap can be any expression which evaluates to string or integer: e.g. @sum[NR] = $a + $b, @sum[$a."-".$b] = $x, etc.

Their names and their values are entirely under your control; they change only when you assign to them.

Just as for field names in stream records, if you want to define out-of-stream variables with special characters such as . then you can use braces, e.g. '@{variable.name}["index"]'.

You may use a computed key in square brackets, e.g.

$ echo s=green,t=blue,a=3,b=4 | mlr put -q '@[$s."_".$t] = $a * $b; emit all'
green_blue=12

Out-of-stream variables are scoped to the put command in which they appear. In particular, if you have two or more put commands separated by then, each put will have its own set of out-of-stream variables:

$ cat data/a.dkvp
a=1,b=2,c=3
a=4,b=5,c=6

$ mlr put '@sum += $a; end {emit @sum}' then put 'ispresent($a) {$a=10*$a; @sum += $a}; end {emit @sum}' data/a.dkvp
a=10,b=2,c=3
a=40,b=5,c=6
sum=5
sum=50

Out-of-stream variables are read-write: you can do $sum=@sum, @sum=$sum, etc.

Indexed out-of-stream variables for put

Using an index on the @count and @sum variables, we get the benefit of the -g (group-by) option which mlr stats1 and various other Miller commands have:

$ mlr put -q '
  @x_count[$a] += 1;
  @x_sum[$a] += $x;
  end {
    emit @x_count, "a";
    emit @x_sum, "a";
  }
' ../data/small
a=pan,x_count=2
a=eks,x_count=3
a=wye,x_count=2
a=zee,x_count=2
a=hat,x_count=1
a=pan,x_sum=0.849416
a=eks,x_sum=1.751863
a=wye,x_sum=0.777892
a=zee,x_sum=1.125680
a=hat,x_sum=0.031442

$ mlr stats1 -a count,sum -f x -g a ../data/small
a=pan,x_count=2,x_sum=0.849416
a=eks,x_count=3,x_sum=1.751863
a=wye,x_count=2,x_sum=0.777892
a=zee,x_count=2,x_sum=1.125680
a=hat,x_count=1,x_sum=0.031442

Indices can be arbitrarily deep — here there are two or more of them:

$ mlr --from data/medium put -q '
  @x_count[$a][$b] += 1;
  @x_sum[$a][$b] += $x;
  end {
    emit (@x_count, @x_sum), "a", "b";
  }
'
a=pan,b=pan,x_count=427,x_sum=219.185129
a=pan,b=wye,x_count=395,x_sum=198.432931
a=pan,b=eks,x_count=429,x_sum=216.075228
a=pan,b=hat,x_count=417,x_sum=205.222776
a=pan,b=zee,x_count=413,x_sum=205.097518
a=eks,b=pan,x_count=371,x_sum=179.963030
a=eks,b=wye,x_count=407,x_sum=196.945286
a=eks,b=zee,x_count=357,x_sum=176.880365
a=eks,b=eks,x_count=413,x_sum=215.916097
a=eks,b=hat,x_count=417,x_sum=208.783171
a=wye,b=wye,x_count=377,x_sum=185.295850
a=wye,b=pan,x_count=392,x_sum=195.847900
a=wye,b=hat,x_count=426,x_sum=212.033183
a=wye,b=zee,x_count=385,x_sum=194.774048
a=wye,b=eks,x_count=386,x_sum=204.812961
a=zee,b=pan,x_count=389,x_sum=202.213804
a=zee,b=wye,x_count=455,x_sum=233.991394
a=zee,b=eks,x_count=391,x_sum=190.961778
a=zee,b=zee,x_count=403,x_sum=206.640635
a=zee,b=hat,x_count=409,x_sum=191.300006
a=hat,b=wye,x_count=423,x_sum=208.883010
a=hat,b=zee,x_count=385,x_sum=196.349450
a=hat,b=eks,x_count=389,x_sum=189.006793
a=hat,b=hat,x_count=381,x_sum=182.853532
a=hat,b=pan,x_count=363,x_sum=168.553807

The idea is that stats1, and other Miller commands, encapsulate frequently-used patterns with a minimum of keystroking (and run a little faster), whereas using out-of-stream variables you have more flexibility and control in what you do.

Begin/end blocks can be mixed with pattern/action blocks. For example:

$ mlr put '
  begin {
    @num_total = 0;
    @num_positive = 0;
  };
  @num_total += 1;
  $x > 0.0 {
    @num_positive += 1;
    $y = log10($x); $z = sqrt($y)
  };
  end {
    emitf @num_total, @num_positive
  }
' data/put-gating-example-1.dkvp
x=-1
x=0
x=1,y=0.000000,z=0.000000
x=2,y=0.301030,z=0.548662
x=3,y=0.477121,z=0.690740
num_total=5,num_positive=3

Aggregate variable assignments for put

There are three remaining kinds of variable assignment using out-of-stream variables, the last two of which use the $* syntax:

  • Recursive copy of out-of-stream variables
  • Out-of-stream variable assigned to full stream record
  • Full stream record assigned to an out-of-stream variable

Example recursive copy of out-of-stream variables:

$ mlr --opprint put -q '@v["sum"] += $x; @v["count"] += 1; end{dump; @w = @v; dump}' data/small
{
  "v": {
    "sum": 2.264762,
    "count": 5
  }
}
{
  "v": {
    "sum": 2.264762,
    "count": 5
  },
  "w": {
    "sum": 2.264762,
    "count": 5
  }
}

Example of out-of-stream variable assigned to full stream record, where the 2nd record is stashed, and the 4th record is overwritten with that:

$ mlr put 'NR == 2 {@keep = $*}; NR == 4 {$* = @keep}' data/small
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729

Example of full stream record assigned to an out-of-stream variable, finding the record for which the x field has the largest value in the input stream:

$ cat data/small
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729

$ mlr --opprint put -q 'isnull(@xmax) || $x > @xmax {@xmax=$x; @recmax=$*}; end {emit @recmax}' data/small
a   b   i x                  y
eks pan 2 0.7586799647899636 0.5221511083334797

Keywords for filter and put

$ mlr --help-all-keywords
filter: includes/excludes the record in the output record stream.

  Example: mlr --from f.dat put 'filter (NR == 2 || $x > 5.4)'

  Instead of put with 'filter false' you can simply use put -q.  The following
  uses the input record to accumulate data but only prints the running sum
  without printing the input record:

  Example: mlr --from f.dat put -q '@running_sum += $x * $y; emit @running_sum'

unset: clears field(s) from the current record, or an out-of-stream variable.

  Example: mlr --from f.dat put 'unset $x'
  Example: mlr --from f.dat put 'unset $*'
  Example: mlr --from f.dat put 'for (k, v in $*) { if (k =~ "a.*") { unset $[k] } }'
  Example: mlr --from f.dat put '...; unset @sums'
  Example: mlr --from f.dat put '...; unset @sums["green"]'
  Example: mlr --from f.dat put '...; unset @*'

tee: prints the current record to specified file.
  This is an immediate print to the specified file (except for pprint format
  which of course waits until the end of the input stream to format all output).

  The > and >> are for write and append, as in the shell, but (as with awk) the
  file-overwrite for > is on first write, not per record. The | is for piping to
  a process which will process the data. There will be one open file for each
  distinct file name (for > and >>) or one subordinate process for each distinct
  value of the piped-to command (for |). Output-formatting flags are taken from
  the main command line.

  You can use any of the output-format command-line flags, e.g. --ocsv, --ofs,
  etc., to control the format of the output. See also mlr -h.

  Example: mlr --from f.dat put 'tee >  "/tmp/data-".$a, $*'
  Example: mlr --from f.dat put 'tee >> "/tmp/data-".$a.$b, $*'
  Example: mlr --from f.dat put 'tee >  stderr, $*'
  Example: mlr --from f.dat put -q 'tee | "tr [a-z\] [A-Z\]", $*'
  Example: mlr --from f.dat put -q 'tee | "tr [a-z\] [A-Z\] > /tmp/data-".$a, $*'
  Example: mlr --from f.dat put -q 'tee | "gzip > /tmp/data-".$a.".gz", $*'
  Example: mlr --from f.dat put -q --ojson 'tee | "gzip > /tmp/data-".$a.".gz", $*'

emit: inserts an out-of-stream variable into the output record stream. Hashmap
  indices present in the data but not slotted by emit arguments are not output.

  With >, >>, or |, the data do not become part of the output record stream but
  are instead redirected.

  The > and >> are for write and append, as in the shell, but (as with awk) the
  file-overwrite for > is on first write, not per record. The | is for piping to
  a process which will process the data. There will be one open file for each
  distinct file name (for > and >>) or one subordinate process for each distinct
  value of the piped-to command (for |). Output-formatting flags are taken from
  the main command line.

  You can use any of the output-format command-line flags, e.g. --ocsv, --ofs,
  etc., to control the format of the output if the output is redirected. See also mlr -h.

  Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit @sums'
  Example: mlr --from f.dat put --ojson '@sums[$a][$b]+=$x; emit > "tap-".$a.$b.".dat", @sums'
  Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit @sums, "index1", "index2"'
  Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit @*, "index1", "index2"'
  Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit >  "mytap.dat", @*, "index1", "index2"'
  Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit >> "mytap.dat", @*, "index1", "index2"'
  Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit | "gzip > mytap.dat.gz", @*, "index1", "index2"'
  Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit > stderr, @*, "index1", "index2"'
  Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit | "grep somepattern", @*, "index1", "index2"'

  Please see http://johnkerl.org/miller/doc for more information.

emitp: inserts an out-of-stream variable into the output record stream.
  Hashmap indices present in the data but not slotted by emitp arguments are
  output concatenated with ":".

  With >, >>, or |, the data do not become part of the output record stream but
  are instead redirected.

  The > and >> are for write and append, as in the shell, but (as with awk) the
  file-overwrite for > is on first write, not per record. The | is for piping to
  a process which will process the data. There will be one open file for each
  distinct file name (for > and >>) or one subordinate process for each distinct
  value of the piped-to command (for |). Output-formatting flags are taken from
  the main command line.

  You can use any of the output-format command-line flags, e.g. --ocsv, --ofs,
  etc., to control the format of the output if the output is redirected. See also mlr -h.

  Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp @sums'
  Example: mlr --from f.dat put --opprint '@sums[$a][$b]+=$x; emitp > "tap-".$a.$b.".dat", @sums'
  Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp @sums, "index1", "index2"'
  Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp @*, "index1", "index2"'
  Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp >  "mytap.dat", @*, "index1", "index2"'
  Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp >> "mytap.dat", @*, "index1", "index2"'
  Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp | "gzip > mytap.dat.gz", @*, "index1", "index2"'
  Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp > stderr, @*, "index1", "index2"'
  Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp | "grep somepattern", @*, "index1", "index2"'

  Please see http://johnkerl.org/miller/doc for more information.

emitf: inserts non-indexed out-of-stream variable(s) side-by-side into the
  output record stream.

  With >, >>, or |, the data do not become part of the output record stream but
  are instead redirected.

  The > and >> are for write and append, as in the shell, but (as with awk) the
  file-overwrite for > is on first write, not per record. The | is for piping to
  a process which will process the data. There will be one open file for each
  distinct file name (for > and >>) or one subordinate process for each distinct
  value of the piped-to command (for |). Output-formatting flags are taken from
  the main command line.

  You can use any of the output-format command-line flags, e.g. --ocsv, --ofs,
  etc., to control the format of the output if the output is redirected. See also mlr -h.

  Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf @a'
  Example: mlr --from f.dat put --oxtab '@a=$i;@b+=$x;@c+=$y; emitf > "tap-".$i.".dat", @a'
  Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf @a, @b, @c'
  Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf > "mytap.dat", @a, @b, @c'
  Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf >> "mytap.dat", @a, @b, @c'
  Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf > stderr, @a, @b, @c'
  Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf | "grep somepattern", @a, @b, @c'
  Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf | "grep somepattern > mytap.dat", @a, @b, @c'

  Please see http://johnkerl.org/miller/doc for more information.

dump: prints all currently defined out-of-stream variables immediately
  to stdout as JSON.

  With >, >>, or |, the data do not become part of the output record stream but
  are instead redirected.

  The > and >> are for write and append, as in the shell, but (as with awk) the
  file-overwrite for > is on first write, not per record. The | is for piping to
  a process which will process the data. There will be one open file for each
  distinct file name (for > and >>) or one subordinate process for each distinct
  value of the piped-to command (for |). Output-formatting flags are taken from
  the main command line.

  Example: mlr --from f.dat put -q '@v[NR]=$*; end { dump }'
  Example: mlr --from f.dat put -q '@v[NR]=$*; end { dump >  "mytap.dat"}'
  Example: mlr --from f.dat put -q '@v[NR]=$*; end { dump >> "mytap.dat"}'
  Example: mlr --from f.dat put -q '@v[NR]=$*; end { dump | "jq .[]"}'

edump: prints all currently defined out-of-stream variables immediately
  to stderr as JSON.

  Example: mlr --from f.dat put -q '@v[NR]=$*; end { edump }'

print: prints expression immediately to stdout.
  Example: mlr --from f.dat put -q 'print "The sum of x and y is ".($x+$y)'
  Example: mlr --from f.dat put -q 'for (k, v in $*) { print k . " => " . v }'
  Example: mlr --from f.dat put  '(NR % 1000 == 0) { print > stderr, "Checkpoint ".NR}'

printn: prints expression immediately to stdout, without trailing newline.
  Example: mlr --from f.dat put -q 'printn "."; end { print "" }'

eprint: prints expression immediately to stderr.
  Example: mlr --from f.dat put -q 'eprint "The sum of x and y is ".($x+$y)'
  Example: mlr --from f.dat put -q 'for (k, v in $*) { eprint k . " => " . v }'
  Example: mlr --from f.dat put  '(NR % 1000 == 0) { eprint "Checkpoint ".NR}'

eprintn: prints expression immediately to stderr, without trailing newline.
  Example: mlr --from f.dat put -q 'eprintn "The sum of x and y is ".($x+$y); eprint ""'

stdout: Used for tee, emit, emitf, emitp, print, and dump in place of filename
  to print to standard output.

stderr: Used for tee, emit, emitf, emitp, print, and dump in place of filename
  to print to standard error.

Control structures

Pattern-action blocks

These are reminiscent of awk syntax. They can be used to allow assignments to be done only when appropriate — e.g. for math-function domain restrictions, regex-matching, and so on:

$ mlr cat data/put-gating-example-1.dkvp
x=-1
x=0
x=1
x=2
x=3

$ mlr put '$x > 0.0 { $y = log10($x); $z = sqrt($y) }' data/put-gating-example-1.dkvp
x=-1
x=0
x=1,y=0.000000,z=0.000000
x=2,y=0.301030,z=0.548662
x=3,y=0.477121,z=0.690740

$ mlr cat data/put-gating-example-2.dkvp
a=abc_123
a=some other name
a=xyz_789

$ mlr put '$a =~ "([a-z]+)_([0-9]+)" { $b = "left_\1"; $c = "right_\2" }' data/put-gating-example-2.dkvp
a=abc_123,b=left_abc,c=right_123
a=some other name
a=xyz_789,b=left_xyz,c=right_789

This produces heteregenous output which Miller, of course, has no problems with (see Record-heterogeneity). But if you want homogeneous output, the curly braces can be replaced with a semicolon between the expression and the body statements. This causes put to evaluate the boolean expression (along with any side effects, namely, regex-captures \1, \2, etc.) but doesn’t use it as a criterion for whether subsequent assignments should be executed. Instead, subsequent assignments are done unconditionally:

$ mlr put '$x > 0.0; $y = log10($x); $z = sqrt($y)' data/put-gating-example-1.dkvp
x=-1,y=nan,z=nan
x=0,y=-inf,z=nan
x=1,y=0.000000,z=0.000000
x=2,y=0.301030,z=0.548662
x=3,y=0.477121,z=0.690740

$ mlr put '$a =~ "([a-z]+)_([0-9]+)"; $b = "left_\1"; $c = "right_\2"' data/put-gating-example-2.dkvp
a=abc_123,b=left_abc,c=right_123
a=some other name,b=left_,c=right_
a=xyz_789,b=left_xyz,c=right_789

If-statements

These are again reminiscent of awk. Pattern-action blocks are a special case of if with no elif or else blocks, no if keyword, and parentheses optional around the boolean expression:

mlr put 'NR == 4 {$foo = "bar"}'

mlr put 'if (NR == 4) {$foo = "bar"}'

Compound statements use elif (rather than elsif or else if):

mlr put '
  if (NR == 2) {
    ...
  } elif (NR ==4) {
    ...
  } elif (NR ==6) {
    ...
  } else {
    ...
  }
'

While and do-while loops

Miller’s while and do-while are unsurprising in comparison to various languages, as are break and continue:

$ echo x=1,y=2 | mlr put '
  while (NF < 10) {
    $[NF+1] = ""
  }
  $foo = "bar"
'
x=1,y=2,3=,4=,5=,6=,7=,8=,9=,10=,foo=bar

$ echo x=1,y=2 | mlr put '
  do {
    $[NF+1] = "";
    if (NF == 5) {
      break
    }
  } while (NF < 10);
  $foo = "bar"
'
x=1,y=2,3=,4=,5=,foo=bar

A break or continue within nested conditional blocks or if-statements will, of course, propagate to the innermost loop enclosing them, if any. A break or continue outside a loop is a syntax error that will be flagged as soon as the expression is parsed, before any input records are ingested.

The existence of while, do-while, and for loops in Miller’s DSL means that you can create infinite-loop scenarios inadvertently. In particular, please recall that DSL statements are executed once if in begin or end blocks, and once per record otherwise. For example, while (NR < 10) will never terminate as NR is only incremented between records.

For-loops

While Miller’s while and do-while statements are much as in many other languages, for loops are more idiosyncratic to Miller. They are loops over key-value pairs, whether in stream records or out-of-stream variables: more reminiscent of foreach, as in (for example) PHP.

There are three variants: for-loop over key-value pairs in the current stream record, for-loop over key-value pairs in an out-of-stream variable, and C-style triple-for loops. In each of the first two cases the in keyword specifies the hashmap being iterated over, and the variable names between for and in are bound to the keys and values, respectively, of the hashmap’s key-value pairs on each loop iteration. As with while and do-while, a break or continue within nested control structures will propagate to the innermost loop enclosing them, if any, and a break or continue outside a loop is a syntax error that will be flagged as soon as the expression is parsed, before any input records are ingested.

For-loop over the current stream record:

$ cat data/for-srec-example.tbl
label1 label2 f1  f2  f3
blue   green  100 240 350
red    green  120 11  195
yellow blue   140 0   240

$ mlr --pprint --from data/for-srec-example.tbl put '
  $sum1 = $f1 + $f2 + $f3;
  $sum2 = 0;
  $sum3 = 0;
  for (key, value in $*) {
    if (key =~ "^f[0-9]+") {
      $sum2 += value;
      $sum3 += $[key];
    }
  }
'
label1 label2 f1  f2  f3  sum1 sum2 sum3
blue   green  100 240 350 690  690  690
red    green  120 11  195 326  326  326
yellow blue   140 0   240 380  380  380

$ mlr --from data/small --opprint put 'for (k,v in $*) { $[k."_type"] = typeof(v) }'
a   b   i x                   y                   a_type    b_type    i_type x_type   y_type
pan pan 1 0.3467901443380824  0.7268028627434533  MT_STRING MT_STRING MT_INT MT_FLOAT MT_FLOAT
eks pan 2 0.7586799647899636  0.5221511083334797  MT_STRING MT_STRING MT_INT MT_FLOAT MT_FLOAT
wye wye 3 0.20460330576630303 0.33831852551664776 MT_STRING MT_STRING MT_INT MT_FLOAT MT_FLOAT
eks wye 4 0.38139939387114097 0.13418874328430463 MT_STRING MT_STRING MT_INT MT_FLOAT MT_FLOAT
wye pan 5 0.5732889198020006  0.8636244699032729  MT_STRING MT_STRING MT_INT MT_FLOAT MT_FLOAT

Note that the value of the current field in the for-loop can be gotten either using the bound variable value, or through a computed field name using square brackets as in $[key].

Important note: to avoid inconsistent looping behavior in case you’re setting new fields (and/or unsetting existing ones) while looping over the record, Miller makes a copy of the record before the loop: loop variables are bound from the copy and all other reads/writes involve the record itself:

$ mlr --from data/small --opprint put '
  $sum1 = 0;
  $sum2 = 0;
  for (k,v in $*) {
    if (isnumeric(v)) {
      $sum1 +=v;
      $sum2 += $[k];
    }
  }
'
a   b   i x                   y                   sum1     sum2
pan pan 1 0.3467901443380824  0.7268028627434533  2.073593 8.294372
eks pan 2 0.7586799647899636  0.5221511083334797  3.280831 13.123324
wye wye 3 0.20460330576630303 0.33831852551664776 3.542922 14.171687
eks wye 4 0.38139939387114097 0.13418874328430463 4.515588 18.062353
wye pan 5 0.5732889198020006  0.8636244699032729  6.436913 25.747654

It can be confusing to modify the stream record while iterating over a copy of it, so instead you might find it simpler to use an out-of-stream variable in the loop and only update the stream record after the loop:

$ mlr --from data/small --opprint put '
  @sum = 0;
  for (k,v in $*) {
    if (isnumeric(v)) {
      @sum += $[k];
    }
  }
  $sum = @sum
'
a   b   i x                   y                   sum
pan pan 1 0.3467901443380824  0.7268028627434533  2.073593
eks pan 2 0.7586799647899636  0.5221511083334797  3.280831
wye wye 3 0.20460330576630303 0.33831852551664776 3.542922
eks wye 4 0.38139939387114097 0.13418874328430463 4.515588
wye pan 5 0.5732889198020006  0.8636244699032729  6.436913

For-loop over out-of-stream variable: This is similar to looping over the current stream record except for additional degrees of freedom: you can start iterating on sub-hashmaps of an out-of-stream variable; you can loop over nested keys; you can loop over all out-of-stream variables. As with for-loops over stream records, the bound variables are bound to a copy of the sub-hashmap as it was before the loop started. The sub-hashmap is specified by square-bracketed indices after in, and additional deeper indices are bound to loop key-variables. The terminal values are bound to the loop value-variable whenever the keys are neither too shallow, nor too deep. Example indexing is as follows:

# Parentheses are optional for single key:
for (k1,           v in @a["b"]["c"]) { ... }
for ((k1),         v in @a["b"]["c"]) { ... }
# Parentheses are required for multiple keys:
for ((k1, k2),     v in @a["b"]["c"]) { ... } # Loop over subhashmap of a variable
for ((k1, k2, k3), v in @a["b"]["c"]) { ... } # Ditto
for ((k1, k2, k3), v in @a { ... }            # Loop over variable starting from basename
for ((k1, k2, k3), v in @* { ... }            # Loop over all variables (k1 is bound to basename)

That’s confusing in the abstract, so a concrete example is in order. Suppose the out-of-stream variable @myvar is populated as follows:

$ mlr --opprint --from data/small head -n 2 then put -q '
  begin {
    @myvar["nesting-is-too-shallow"] = 1;
    @myvar["nesting-is"]["just-right"] = 2;
    @myvar["nesting-is"]["also-just-right"] = 3;
    @myvar["nesting"]["is"]["too-deep"] = 4;
  }
  end {
    dump
  }
'
{
  "myvar": {
    "nesting-is-too-shallow": 1,
    "nesting-is": {
      "just-right": 2,
      "also-just-right": 3
    },
    "nesting": {
      "is": {
        "too-deep": 4
      }
    }
  }
}

Then the too-shallow parts — indexed by the basename myvar and the index "nesting-is-too-shallow" — have depth two (basename and one index specify a terminal value) and can be gotten as follows:

$ mlr --from data/small head -n 2 then put -q '
  begin {
    @myvar["nesting-is-too-shallow"] = 1;
    @myvar["nesting-is"]["just-right"] = 2;
    @myvar["nesting-is"]["also-just-right"] = 3;
    @myvar["nesting"]["is"]["too-deep"] = 4;
  }
  end {
    for (k, v in @myvar) {
      @terminal[k] = v
    }
    emit @terminal, "index1"
  }
'
index1=nesting-is-too-shallow,terminal=1

$ mlr --from data/small head -n 2 then put -q '
  begin {
    @myvar["nesting-is-too-shallow"] = 1;
    @myvar["nesting-is"]["just-right"] = 2;
    @myvar["nesting-is"]["also-just-right"] = 3;
    @myvar["nesting"]["is"]["too-deep"] = 4;
  }
  end {
    for ((k1, k2), v in @*) {
      @terminal[k1][k2] = v
    }
    emit @terminal, "basename", "index1"
  }
'
basename=myvar,index1=nesting-is-too-shallow,terminal=1

Note that it would take more than these two indices to reach the deeper values in the hashmap so they aren’t bound in either of these for-loops.

By contrast, the "just-right" parts have depth three (basename and two indices specify a terminal value) and can be gotten at by any of the following:

$ mlr --from data/small head -n 2 then put -q '
  begin {
    @myvar["nesting-is-too-shallow"] = 1;
    @myvar["nesting-is"]["just-right"] = 2;
    @myvar["nesting-is"]["also-just-right"] = 3;
    @myvar["nesting"]["is"]["too-deep"] = 4;
  }
  end {
    for ((k1), v in @myvar["nesting-is"]) {
      @terminal[k1] = v
    }
    emit @terminal, "index1"
  }
'
index1=just-right,terminal=2
index1=also-just-right,terminal=3

$ mlr --from data/small head -n 2 then put -q '
  begin {
    @myvar["nesting-is-too-shallow"] = 1;
    @myvar["nesting-is"]["just-right"] = 2;
    @myvar["nesting-is"]["also-just-right"] = 3;
    @myvar["nesting"]["is"]["too-deep"] = 4;
  }
  end {
    for ((k1, k2), v in @myvar) {
      @terminal[k1][k2] = v
    }
    emit @terminal, "index1", "index2"
  }
'
index1=nesting-is,index2=just-right,terminal=2
index1=nesting-is,index2=also-just-right,terminal=3

$ mlr --from data/small head -n 2 then put -q '
  begin {
    @myvar["nesting-is-too-shallow"] = 1;
    @myvar["nesting-is"]["just-right"] = 2;
    @myvar["nesting-is"]["also-just-right"] = 3;
    @myvar["nesting"]["is"]["too-deep"] = 4;
  }
  end {
    for ((k1, k2, k3), v in @*) {
      @terminal[k1][k2][k3] = v
    }
    emit @terminal, "basename", "index1", "index2"
  }
'
basename=myvar,index1=nesting-is,index2=just-right,terminal=2
basename=myvar,index1=nesting-is,index2=also-just-right,terminal=3

Note that three key levels are specified here: basename and two indices. So these for-loops don’t produce the depth-two or depth-four entries in the hashmap.

C-style triple-for loops are supported as follows:

$ mlr --from data/small --opprint put '
  local suma = 0;
  local sumb = 0;
  for (local a = 1, local b = 1; a <= NR; a += 1, b *= 2) {
    suma += a;
    sumb += b;
  }
  $suma = suma;
  $sumb = sumb;
'
a   b   i x                   y                   suma sumb
pan pan 1 0.3467901443380824  0.7268028627434533  1    1
eks pan 2 0.7586799647899636  0.5221511083334797  3    3
wye wye 3 0.20460330576630303 0.33831852551664776 6    7
eks wye 4 0.38139939387114097 0.13418874328430463 10   15
wye pan 5 0.5732889198020006  0.8636244699032729  15   31

Notes:

  • In for (start; continuation; update) { body }, the start and update statements may be empty, single statements, or multiple comma-separated statements.
  • In particular, you may use $-variables and/or @-variables in the start, continuation, and/or update steps (as well as the body, of course).
  • Miller has no comma operator; the continuation must be a single expression which evaluates to boolean.
  • As with all for/if/while statements in Miller, the curly braces are required even if the body is a single statement, or empty.

Begin/end blocks for put

Miller supports an awk-like begin/end syntax. The statements in the begin block are executed before any input records are read; the statements in the end block are executed after the last input record is read. (If you want to execute some statement at the start of each file, not at the start of the first file as with begin, you might use a pattern/action block of the form FNR == 1 { ... }.) All statements outside of begin or end are, of course, executed on every input record. Semicolons separate statements inside or outside of begin/end blocks; semicolons are required between begin/end block bodies and any subsequent statement. For example:

$ mlr put '
  begin { @sum = 0 };
  @x_sum += $x;
  end { emit @x_sum }
' ../data/small
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
a=zee,b=pan,i=6,x=0.5271261600918548,y=0.49322128674835697
a=eks,b=zee,i=7,x=0.6117840605678454,y=0.1878849191181694
a=zee,b=wye,i=8,x=0.5985540091064224,y=0.976181385699006
a=hat,b=wye,i=9,x=0.03144187646093577,y=0.7495507603507059
a=pan,b=wye,i=10,x=0.5026260055412137,y=0.9526183602969864
x_sum=4.536294

Since uninitialized out-of-stream variables default to 0 for addition/substraction and 1 for multiplication when they appear on expression right-hand sides (as in awk), the above can be written more succinctly as

$ mlr put '
  @x_sum += $x;
  end { emit @x_sum }
' ../data/small
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
a=zee,b=pan,i=6,x=0.5271261600918548,y=0.49322128674835697
a=eks,b=zee,i=7,x=0.6117840605678454,y=0.1878849191181694
a=zee,b=wye,i=8,x=0.5985540091064224,y=0.976181385699006
a=hat,b=wye,i=9,x=0.03144187646093577,y=0.7495507603507059
a=pan,b=wye,i=10,x=0.5026260055412137,y=0.9526183602969864
x_sum=4.536294

The put -q option is a shorthand which suppresses printing of each output record, with only emit statements being output. So to get only summary outputs, one could write

$ mlr put -q '
  @x_sum += $x;
  end { emit @x_sum }
' ../data/small
x_sum=4.536294

We can do similarly with multiple out-of-stream variables:

$ mlr put -q '
  @x_count += 1;
  @x_sum += $x;
  end {
    emit @x_count;
    emit @x_sum;
  }
' ../data/small
x_count=10
x_sum=4.536294

This is of course not much different than

$ mlr stats1 -a count,sum -f x ../data/small
x_count=10,x_sum=4.536294

Note that it’s a syntax error for begin/end blocks to refer to field names (beginning with $), since these execute outside the context of input records.

Output statements for put

xxx grist: You can output built-in variables indirectly, by assigning them to a non-built-in variable: e.g. $nr = NR adds a field named nr to each output record, containing the value of NR as of when that record was ingested.

xxx grist: You can output these in four ways: (1) assign them to stream-record fields, e.g. $cumulative_sum = @sum; (2) use emit/emitp/emitf, e.g. @sum += $x; emit @sum which produces an extra output record such as sum=3.1648382; (3) use the dump or edump keywords, which immediately print all out-of-stream variables as a JSON data structure to the standard output or standard error (respectively), or (4) use the print or eprint keywords which immediately print an expression to standard output or standard error, respectively. Note that dump, edump, print, and eprint don’t output records which participate in then-chaining; rather, they’re just immediate prints to stdout/stderr. The printn and eprintn keywords are the same except that they don’t print final newlines.

Emit statements for put

As noted above, there are three ways to output out-of-stream variables: (1) Assign them to stream-record fields, e.g. $cumulative_sum = @sum; (2) Use emit, e.g. @sum += $x; emit @sum which produces an extra output record such as sum=3.1648382; (3) Use the dump keyword, which immediately prints all out-of-stream variables to the standard output as a JSON data structure. Note that the latter aren’t output records which participate in then-chaining; rather, they’re just an immediate print to stdout. This section is about emit.

There are three variants: emitf, emit, and emitp. Keep in mind that out-of-stream variables are a nested, multi-level hashmap (directly viewable as JSON using dump), whereas Miller output records are lists of single-level key-value pairs. The three emit variants allow you to control how the multilevel hashmaps are flatten down to output records.

Use emitf to output several out-of-stream variables side-by-side in the same output record. For emitf these mustn’t have indexing using @name[...]. Example:

$ mlr put -q '@count += 1; @x_sum += $x; @y_sum += $y; end { emitf @count, @x_sum, @y_sum}' data/small
count=5,x_sum=2.264762,y_sum=2.585086

Use emit to output an out-of-stream variable. If it’s non-indexed you’ll get a simple key-value pair:

$ cat data/small
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729

$ mlr put -q '@sum += $x; end { dump }' data/small
{
  "sum": 2.264762
}

$ mlr put -q '@sum += $x; end { emit @sum }' data/small
sum=2.264762

If it’s indexed then use as many names after emit as there are indices:

$ mlr put -q '@sum[$a] += $x; end { dump }' data/small
{
  "sum": {
    "pan": 0.346790,
    "eks": 1.140079,
    "wye": 0.777892
  }
}

$ mlr put -q '@sum[$a] += $x; end { emit @sum, "a" }' data/small
a=pan,sum=0.346790
a=eks,sum=1.140079
a=wye,sum=0.777892

$ mlr put -q '@sum[$a][$b] += $x; end { dump }' data/small
{
  "sum": {
    "pan": {
      "pan": 0.346790
    },
    "eks": {
      "pan": 0.758680,
      "wye": 0.381399
    },
    "wye": {
      "wye": 0.204603,
      "pan": 0.573289
    }
  }
}

$ mlr put -q '@sum[$a][$b] += $x; end { emit @sum, "a", "b" }' data/small
a=pan,b=pan,sum=0.346790
a=eks,b=pan,sum=0.758680
a=eks,b=wye,sum=0.381399
a=wye,b=wye,sum=0.204603
a=wye,b=pan,sum=0.573289

$ mlr put -q '@sum[$a][$b][$i] += $x; end { dump }' data/small
{
  "sum": {
    "pan": {
      "pan": {
        "1": 0.346790
      }
    },
    "eks": {
      "pan": {
        "2": 0.758680
      },
      "wye": {
        "4": 0.381399
      }
    },
    "wye": {
      "wye": {
        "3": 0.204603
      },
      "pan": {
        "5": 0.573289
      }
    }
  }
}

$ mlr put -q '@sum[$a][$b][$i] += $x; end { emit @sum, "a", "b", "i" }' data/small
a=pan,b=pan,i=1,sum=0.346790
a=eks,b=pan,i=2,sum=0.758680
a=eks,b=wye,i=4,sum=0.381399
a=wye,b=wye,i=3,sum=0.204603
a=wye,b=pan,i=5,sum=0.573289

Now for emitp: if you have as many names following emit as there are levels in the out-of-stream variable’s hashmap, then emit and emitp do the same thing. Where they differ is when you don’t specify as many names as there are hashmap levels. In this case, Miller needs to flatten multiple map indices down to output-record keys: emitp includes full prefixing (hence the p in emitp) while emit takes the deepest hashmap key as the output-record key:

$ mlr put -q '@sum[$a][$b] += $x; end { dump }' data/small
{
  "sum": {
    "pan": {
      "pan": 0.346790
    },
    "eks": {
      "pan": 0.758680,
      "wye": 0.381399
    },
    "wye": {
      "wye": 0.204603,
      "pan": 0.573289
    }
  }
}

$ mlr put -q '@sum[$a][$b] += $x; end { emit @sum, "a" }' data/small
a=pan,pan=0.346790
a=eks,pan=0.758680,wye=0.381399
a=wye,wye=0.204603,pan=0.573289

$ mlr put -q '@sum[$a][$b] += $x; end { emit @sum }' data/small
pan=0.346790
pan=0.758680,wye=0.381399
wye=0.204603,pan=0.573289

$ mlr put -q '@sum[$a][$b] += $x; end { emitp @sum, "a" }' data/small
a=pan,sum:pan=0.346790
a=eks,sum:pan=0.758680,sum:wye=0.381399
a=wye,sum:wye=0.204603,sum:pan=0.573289

$ mlr put -q '@sum[$a][$b] += $x; end { emitp @sum }' data/small
sum:pan:pan=0.346790,sum:eks:pan=0.758680,sum:eks:wye=0.381399,sum:wye:wye=0.204603,sum:wye:pan=0.573289

$ mlr --oxtab put -q '@sum[$a][$b] += $x; end { emitp @sum }' data/small
sum:pan:pan 0.346790
sum:eks:pan 0.758680
sum:eks:wye 0.381399
sum:wye:wye 0.204603
sum:wye:pan 0.573289

Use --oflatsep to specify the character which joins multilevel keys for emitp (it defaults to a colon):

$ mlr put -q --oflatsep / '@sum[$a][$b] += $x; end { emitp @sum, "a" }' data/small
a=pan,sum/pan=0.346790
a=eks,sum/pan=0.758680,sum/wye=0.381399
a=wye,sum/wye=0.204603,sum/pan=0.573289

$ mlr put -q --oflatsep / '@sum[$a][$b] += $x; end { emitp @sum }' data/small
sum/pan/pan=0.346790,sum/eks/pan=0.758680,sum/eks/wye=0.381399,sum/wye/wye=0.204603,sum/wye/pan=0.573289

$ mlr --oxtab put -q --oflatsep / '@sum[$a][$b] += $x; end { emitp @sum }' data/small
sum/pan/pan 0.346790
sum/eks/pan 0.758680
sum/eks/wye 0.381399
sum/wye/wye 0.204603
sum/wye/pan 0.573289

Multi-emit statements for put

You can emit multiple out-of-stream variables side-by-side by including their names in parentheses:

$ mlr --from data/medium --opprint put -q '
  @x_count[$a][$b] += 1;
  @x_sum[$a][$b] += $x;
  end {
      for ((a, b), _ in @x_count) {
          @x_mean[a][b] = @x_sum[a][b] / @x_count[a][b]
      }
      emit (@x_sum, @x_count, @x_mean), "a", "b"
  }
'
a   b   x_sum      x_count x_mean
pan pan 219.185129 427     0.513314
pan wye 198.432931 395     0.502362
pan eks 216.075228 429     0.503672
pan hat 205.222776 417     0.492141
pan zee 205.097518 413     0.496604
eks pan 179.963030 371     0.485076
eks wye 196.945286 407     0.483895
eks zee 176.880365 357     0.495463
eks eks 215.916097 413     0.522799
eks hat 208.783171 417     0.500679
wye wye 185.295850 377     0.491501
wye pan 195.847900 392     0.499612
wye hat 212.033183 426     0.497730
wye zee 194.774048 385     0.505907
wye eks 204.812961 386     0.530604
zee pan 202.213804 389     0.519830
zee wye 233.991394 455     0.514267
zee eks 190.961778 391     0.488393
zee zee 206.640635 403     0.512756
zee hat 191.300006 409     0.467726
hat wye 208.883010 423     0.493813
hat zee 196.349450 385     0.509999
hat eks 189.006793 389     0.485879
hat hat 182.853532 381     0.479931
hat pan 168.553807 363     0.464336

What this does is walk through the first out-of-stream variable (@x_sum in this example) as usual, then for each keylist found (e.g. pan,wye), include the values for the remaining out-of-stream variables (here, @x_count and @x_mean). You should use this when all out-of-stream variables in the emit statement have the same shape and the same keylists.

Emit-all statements for put

Use emit all (or emit @* which is synonumous) to output all out-of-stream variables. You can use the following idiom to get various accumulators output side-by-side (reminiscent of mlr stats1):

$ mlr --from data/small --opprint put -q '@v[$a][$b]["sum"] += $x; @v[$a][$b]["count"] += 1; end{emit @*,"a","b"}'
a   b   sum      count
pan pan 0.346790 1
eks pan 0.758680 1
eks wye 0.381399 1
wye wye 0.204603 1
wye pan 0.573289 1

$ mlr --from data/small --opprint put -q '@sum[$a][$b] += $x; @count[$a][$b] += 1; end{emit @*,"a","b"}'
a   b   sum
pan pan 0.346790
eks pan 0.758680
eks wye 0.381399
wye wye 0.204603
wye pan 0.573289

a   b   count
pan pan 1
eks pan 1
eks wye 1
wye wye 1
wye pan 1

$ mlr --from data/small --opprint put -q '@sum[$a][$b] += $x; @count[$a][$b] += 1; end{emit (@sum, @count),"a","b"}'
a   b   sum      count
pan pan 0.346790 1
eks pan 0.758680 1
eks wye 0.381399 1
wye wye 0.204603 1
wye pan 0.573289 1

Redirected-output statements for put

The tee, emitf, emitp, emit, print, and dump keywords all allow you to redirect output to one or more files or pipe-to commands. The filenames/commands are strings which can be constructed using record-dependent values, so you can do things like splitting a table into multiple files, one for each account ID, and so on.

Details:

  • mlr put sends the current record (possibly modified by the put expression) to the output record stream. Records are then input to the following verb in a then-chain (if any), else printed to standard output (unless put -q). The tee keyword additionally writes the output record to specified file(s) or pipe-to command, or immediately to stdout/stderr.

    $ mlr --help-keyword tee
    tee: prints the current record to specified file.
      This is an immediate print to the specified file (except for pprint format
      which of course waits until the end of the input stream to format all output).
    
      The > and >> are for write and append, as in the shell, but (as with awk) the
      file-overwrite for > is on first write, not per record. The | is for piping to
      a process which will process the data. There will be one open file for each
      distinct file name (for > and >>) or one subordinate process for each distinct
      value of the piped-to command (for |). Output-formatting flags are taken from
      the main command line.
    
      You can use any of the output-format command-line flags, e.g. --ocsv, --ofs,
      etc., to control the format of the output. See also mlr -h.
    
      Example: mlr --from f.dat put 'tee >  "/tmp/data-".$a, $*'
      Example: mlr --from f.dat put 'tee >> "/tmp/data-".$a.$b, $*'
      Example: mlr --from f.dat put 'tee >  stderr, $*'
      Example: mlr --from f.dat put -q 'tee | "tr [a-z\] [A-Z\]", $*'
      Example: mlr --from f.dat put -q 'tee | "tr [a-z\] [A-Z\] > /tmp/data-".$a, $*'
      Example: mlr --from f.dat put -q 'tee | "gzip > /tmp/data-".$a.".gz", $*'
      Example: mlr --from f.dat put -q --ojson 'tee | "gzip > /tmp/data-".$a.".gz", $*'
    

  • mlr put’s emitf, emitp, and emit send out-of-stream variables to the output record stream. These are then input to the following verb in a then-chain (if any), else printed to standard output. When redirected with >, >>, or |, they instead write the out-of-stream variable(s) to specified file(s) or pipe-to command, or immediately to stdout/stderr.

    $ mlr --help-keyword emitf
    emitf: inserts non-indexed out-of-stream variable(s) side-by-side into the
      output record stream.
    
      With >, >>, or |, the data do not become part of the output record stream but
      are instead redirected.
    
      The > and >> are for write and append, as in the shell, but (as with awk) the
      file-overwrite for > is on first write, not per record. The | is for piping to
      a process which will process the data. There will be one open file for each
      distinct file name (for > and >>) or one subordinate process for each distinct
      value of the piped-to command (for |). Output-formatting flags are taken from
      the main command line.
    
      You can use any of the output-format command-line flags, e.g. --ocsv, --ofs,
      etc., to control the format of the output if the output is redirected. See also mlr -h.
    
      Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf @a'
      Example: mlr --from f.dat put --oxtab '@a=$i;@b+=$x;@c+=$y; emitf > "tap-".$i.".dat", @a'
      Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf @a, @b, @c'
      Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf > "mytap.dat", @a, @b, @c'
      Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf >> "mytap.dat", @a, @b, @c'
      Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf > stderr, @a, @b, @c'
      Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf | "grep somepattern", @a, @b, @c'
      Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf | "grep somepattern > mytap.dat", @a, @b, @c'
    
      Please see http://johnkerl.org/miller/doc for more information.
    

    $ mlr --help-keyword emitp
    emitp: inserts an out-of-stream variable into the output record stream.
      Hashmap indices present in the data but not slotted by emitp arguments are
      output concatenated with ":".
    
      With >, >>, or |, the data do not become part of the output record stream but
      are instead redirected.
    
      The > and >> are for write and append, as in the shell, but (as with awk) the
      file-overwrite for > is on first write, not per record. The | is for piping to
      a process which will process the data. There will be one open file for each
      distinct file name (for > and >>) or one subordinate process for each distinct
      value of the piped-to command (for |). Output-formatting flags are taken from
      the main command line.
    
      You can use any of the output-format command-line flags, e.g. --ocsv, --ofs,
      etc., to control the format of the output if the output is redirected. See also mlr -h.
    
      Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp @sums'
      Example: mlr --from f.dat put --opprint '@sums[$a][$b]+=$x; emitp > "tap-".$a.$b.".dat", @sums'
      Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp @sums, "index1", "index2"'
      Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp @*, "index1", "index2"'
      Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp >  "mytap.dat", @*, "index1", "index2"'
      Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp >> "mytap.dat", @*, "index1", "index2"'
      Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp | "gzip > mytap.dat.gz", @*, "index1", "index2"'
      Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp > stderr, @*, "index1", "index2"'
      Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp | "grep somepattern", @*, "index1", "index2"'
    
      Please see http://johnkerl.org/miller/doc for more information.
    

    $ mlr --help-keyword emit
    emit: inserts an out-of-stream variable into the output record stream. Hashmap
      indices present in the data but not slotted by emit arguments are not output.
    
      With >, >>, or |, the data do not become part of the output record stream but
      are instead redirected.
    
      The > and >> are for write and append, as in the shell, but (as with awk) the
      file-overwrite for > is on first write, not per record. The | is for piping to
      a process which will process the data. There will be one open file for each
      distinct file name (for > and >>) or one subordinate process for each distinct
      value of the piped-to command (for |). Output-formatting flags are taken from
      the main command line.
    
      You can use any of the output-format command-line flags, e.g. --ocsv, --ofs,
      etc., to control the format of the output if the output is redirected. See also mlr -h.
    
      Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit @sums'
      Example: mlr --from f.dat put --ojson '@sums[$a][$b]+=$x; emit > "tap-".$a.$b.".dat", @sums'
      Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit @sums, "index1", "index2"'
      Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit @*, "index1", "index2"'
      Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit >  "mytap.dat", @*, "index1", "index2"'
      Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit >> "mytap.dat", @*, "index1", "index2"'
      Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit | "gzip > mytap.dat.gz", @*, "index1", "index2"'
      Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit > stderr, @*, "index1", "index2"'
      Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit | "grep somepattern", @*, "index1", "index2"'
    
      Please see http://johnkerl.org/miller/doc for more information.
    

  • The print and dump keywords produce output immediately to standard output, or to specified file(s) or pipe-to command if present.

    $ mlr --help-keyword print
    print: prints expression immediately to stdout.
      Example: mlr --from f.dat put -q 'print "The sum of x and y is ".($x+$y)'
      Example: mlr --from f.dat put -q 'for (k, v in $*) { print k . " => " . v }'
      Example: mlr --from f.dat put  '(NR % 1000 == 0) { print > stderr, "Checkpoint ".NR}'
    

    $ mlr --help-keyword dump
    dump: prints all currently defined out-of-stream variables immediately
      to stdout as JSON.
    
      With >, >>, or |, the data do not become part of the output record stream but
      are instead redirected.
    
      The > and >> are for write and append, as in the shell, but (as with awk) the
      file-overwrite for > is on first write, not per record. The | is for piping to
      a process which will process the data. There will be one open file for each
      distinct file name (for > and >>) or one subordinate process for each distinct
      value of the piped-to command (for |). Output-formatting flags are taken from
      the main command line.
    
      Example: mlr --from f.dat put -q '@v[NR]=$*; end { dump }'
      Example: mlr --from f.dat put -q '@v[NR]=$*; end { dump >  "mytap.dat"}'
      Example: mlr --from f.dat put -q '@v[NR]=$*; end { dump >> "mytap.dat"}'
      Example: mlr --from f.dat put -q '@v[NR]=$*; end { dump | "jq .[]"}'
    

Unset statements for put

You can clear a map key by assigning the empty string as its value: $x="" or @x="". Using unset you can remove the key entirely. Examples:

$ cat data/small
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729

$ mlr put 'unset $x, $a' data/small
b=pan,i=1,y=0.7268028627434533
b=pan,i=2,y=0.5221511083334797
b=wye,i=3,y=0.33831852551664776
b=wye,i=4,y=0.13418874328430463
b=pan,i=5,y=0.8636244699032729

This can also be done, of course, using mlr cut -x. You can also clear out-of-stream variables, at the base name level, or at an indexed sublevel:

$ mlr put -q '@sum[$a][$b] += $x; end { dump; unset @sum; dump }' data/small
{
  "sum": {
    "pan": {
      "pan": 0.346790
    },
    "eks": {
      "pan": 0.758680,
      "wye": 0.381399
    },
    "wye": {
      "wye": 0.204603,
      "pan": 0.573289
    }
  }
}
{
}

$ mlr put -q '@sum[$a][$b] += $x; end { dump; unset @sum["eks"]; dump }' data/small
{
  "sum": {
    "pan": {
      "pan": 0.346790
    },
    "eks": {
      "pan": 0.758680,
      "wye": 0.381399
    },
    "wye": {
      "wye": 0.204603,
      "pan": 0.573289
    }
  }
}
{
  "sum": {
    "pan": {
      "pan": 0.346790
    },
    "wye": {
      "wye": 0.204603,
      "pan": 0.573289
    }
  }
}

If you use unset all (or unset @* which is synonymous), that will unset all out-of-stream variables which have been defined up to that point.

Filter statements for put

You can use filter within put. In fact, the following two are synonymous:

$ mlr filter 'NR==2 || NR==3' data/small
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776

$ mlr put 'filter NR==2 || NR==3' data/small
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776

The former, of course, is much easier to type. But the latter allows you to define more complex expressions for the filter, and/or do other things in addition to the filter:

$ mlr put '@running_sum += $x; filter @running_sum > 1.3' data/small
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729

$ mlr put '$z = $x * $y; filter $z > 0.3' data/small
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797,z=0.396146
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729,z=0.495106

Built-in functions for filter and put

Each function takes a specific number of arguments, as shown below, except for functions marked as variadic such as min and max. (The latter compute min and max of any number of numerical arguments.) There is no notion of optional or default-on-absent arguments. All argument-passing is positional rather than by name; arguments are passed by value, not by reference.

$ mlr --help-all-functions
+ (class=arithmetic #args=2): Addition.

+ (class=arithmetic #args=1): Unary plus.

- (class=arithmetic #args=2): Subtraction.

- (class=arithmetic #args=1): Unary minus.

* (class=arithmetic #args=2): Multiplication.

/ (class=arithmetic #args=2): Division.

// (class=arithmetic #args=2): Integer division: rounds to negative (pythonic).

% (class=arithmetic #args=2): Remainder; never negative-valued (pythonic).

** (class=arithmetic #args=2): Exponentiation; same as pow, but as an infix
operator.

| (class=arithmetic #args=2): Bitwise OR.

^ (class=arithmetic #args=2): Bitwise XOR.

& (class=arithmetic #args=2): Bitwise AND.

~ (class=arithmetic #args=1): Bitwise NOT. Beware '$y=~$x' since =~ is the
regex-match operator: try '$y = ~$x'.

<< (class=arithmetic #args=2): Bitwise left-shift.

>> (class=arithmetic #args=2): Bitwise right-shift.

== (class=boolean #args=2): String/numeric equality. Mixing number and string
results in string compare.

!= (class=boolean #args=2): String/numeric inequality. Mixing number and string
results in string compare.

=~ (class=boolean #args=2): String (left-hand side) matches regex (right-hand
side), e.g. '$name =~ "^a.*b$"'.

!=~ (class=boolean #args=2): String (left-hand side) does not match regex
(right-hand side), e.g. '$name !=~ "^a.*b$"'.

> (class=boolean #args=2): String/numeric greater-than. Mixing number and string
results in string compare.

>= (class=boolean #args=2): String/numeric greater-than-or-equals. Mixing number
and string results in string compare.

< (class=boolean #args=2): String/numeric less-than. Mixing number and string
results in string compare.

<= (class=boolean #args=2): String/numeric less-than-or-equals. Mixing number
and string results in string compare.

&& (class=boolean #args=2): Logical AND.

|| (class=boolean #args=2): Logical OR.

^^ (class=boolean #args=2): Logical XOR.

! (class=boolean #args=1): Logical negation.

? : (class=boolean #args=3): Ternary operator.

isnull (class=conversion #args=1): True if argument is null (empty or absent), false otherwise

isnotnull (class=conversion #args=1): False if argument is null (empty or absent), true otherwise.

isabsent (class=conversion #args=1): False if field is present in input, false otherwise

ispresent (class=conversion #args=1): True if field is present in input, false otherwise.

isempty (class=conversion #args=1): True if field is present in input with empty value, false otherwise.

isnotempty (class=conversion #args=1): False if field is present in input with empty value, false otherwise

isnumeric (class=conversion #args=1): True if field is present with value inferred to be int or float

isint (class=conversion #args=1): True if field is present with value inferred to be int

isfloat (class=conversion #args=1): True if field is present with value inferred to be float

isbool (class=conversion #args=1): True if field is present with boolean value

isstring (class=conversion #args=1): True if field is present with string (including empty-string) value

boolean (class=conversion #args=1): Convert int/float/bool/string to boolean.

float (class=conversion #args=1): Convert int/float/bool/string to float.

fmtnum (class=conversion #args=2): Convert int/float/bool to string using
printf-style format string, e.g. '$s = fmtnum($n, "%06lld")'.

hexfmt (class=conversion #args=1): Convert int to string, e.g. 255 to "0xff".

int (class=conversion #args=1): Convert int/float/bool/string to int.

string (class=conversion #args=1): Convert int/float/bool/string to string.

typeof (class=conversion #args=1): Convert argument to type of argument (e.g.
MT_STRING). For debug.

. (class=string #args=2): String concatenation.

gsub (class=string #args=3): Example: '$name=gsub($name, "old", "new")'
(replace all).

strlen (class=string #args=1): String length.

sub (class=string #args=3): Example: '$name=sub($name, "old", "new")'
(replace once).

tolower (class=string #args=1): Convert string to lowercase.

toupper (class=string #args=1): Convert string to uppercase.

abs (class=math #args=1): Absolute value.

acos (class=math #args=1): Inverse trigonometric cosine.

acosh (class=math #args=1): Inverse hyperbolic cosine.

asin (class=math #args=1): Inverse trigonometric sine.

asinh (class=math #args=1): Inverse hyperbolic sine.

atan (class=math #args=1): One-argument arctangent.

atan2 (class=math #args=2): Two-argument arctangent.

atanh (class=math #args=1): Inverse hyperbolic tangent.

cbrt (class=math #args=1): Cube root.

ceil (class=math #args=1): Ceiling: nearest integer at or above.

cos (class=math #args=1): Trigonometric cosine.

cosh (class=math #args=1): Hyperbolic cosine.

erf (class=math #args=1): Error function.

erfc (class=math #args=1): Complementary error function.

exp (class=math #args=1): Exponential function e**x.

expm1 (class=math #args=1): e**x - 1.

floor (class=math #args=1): Floor: nearest integer at or below.

invqnorm (class=math #args=1): Inverse of normal cumulative distribution
function. Note that invqorm(urand()) is normally distributed.

log (class=math #args=1): Natural (base-e) logarithm.

log10 (class=math #args=1): Base-10 logarithm.

log1p (class=math #args=1): log(1-x).

logifit (class=math #args=3): Given m and b from logistic regression, compute
fit: $yhat=logifit($x,$m,$b).

madd (class=math #args=3): a + b mod m (integers)

max (class=math variadic): max of n numbers; null loses

mexp (class=math #args=3): a ** b mod m (integers)

min (class=math variadic): Min of n numbers; null loses

mmul (class=math #args=3): a * b mod m (integers)

msub (class=math #args=3): a - b mod m (integers)

pow (class=math #args=2): Exponentiation; same as **.

qnorm (class=math #args=1): Normal cumulative distribution function.

round (class=math #args=1): Round to nearest integer.

roundm (class=math #args=2): Round to nearest multiple of m: roundm($x,$m) is
the same as round($x/$m)*$m

sgn (class=math #args=1): +1 for positive input, 0 for zero input, -1 for
negative input.

sin (class=math #args=1): Trigonometric sine.

sinh (class=math #args=1): Hyperbolic sine.

sqrt (class=math #args=1): Square root.

tan (class=math #args=1): Trigonometric tangent.

tanh (class=math #args=1): Hyperbolic tangent.

urand (class=math #args=0): Floating-point numbers on the unit interval.
Int-valued example: '$n=floor(20+urand()*11)'.

urand32 (class=math #args=0): Integer uniformly distributed 0 and 2**32-1
inclusive.

urandint (class=math #args=2): Integer uniformly distributed between inclusive
integer endpoints.

dhms2fsec (class=time #args=1): Recovers floating-point seconds as in
dhms2fsec("5d18h53m20.250000s") = 500000.250000

dhms2sec (class=time #args=1): Recovers integer seconds as in
dhms2sec("5d18h53m20s") = 500000

fsec2dhms (class=time #args=1): Formats floating-point seconds as in
fsec2dhms(500000.25) = "5d18h53m20.250000s"

fsec2hms (class=time #args=1): Formats floating-point seconds as in
fsec2hms(5000.25) = "01:23:20.250000"

gmt2sec (class=time #args=1): Parses GMT timestamp as integer seconds since
the epoch.

hms2fsec (class=time #args=1): Recovers floating-point seconds as in
hms2fsec("01:23:20.250000") = 5000.250000

hms2sec (class=time #args=1): Recovers integer seconds as in
hms2sec("01:23:20") = 5000

sec2dhms (class=time #args=1): Formats integer seconds as in sec2dhms(500000)
= "5d18h53m20s"

sec2gmt (class=time #args=1): Formats seconds since epoch (integer part)
as GMT timestamp, e.g. sec2gmt(1440768801.7) = "2015-08-28T13:33:21Z".
Leaves non-numbers as-is.

sec2gmtdate (class=time #args=1): Formats seconds since epoch (integer part)
as GMT timestamp with year-month-date, e.g. sec2gmtdate(1440768801.7) = "2015-08-28".
Leaves non-numbers as-is.

sec2hms (class=time #args=1): Formats integer seconds as in
sec2hms(5000) = "01:23:20"

strftime (class=time #args=2): Formats seconds since epoch (integer part)
as timestamp, e.g.
strftime(1440768801.7,"%Y-%m-%dT%H:%M:%SZ") = "2015-08-28T13:33:21Z".

strptime (class=time #args=2): Parses timestamp as integer seconds since epoch,
e.g. strptime("2015-08-28T13:33:21Z","%Y-%m-%dT%H:%M:%SZ") = 1440768801.

systime (class=time #args=0): Floating-point seconds since the epoch,
e.g. 1440768801.748936.

To set the seed for urand, you may specify decimal or hexadecimal 32-bit
numbers of the form "mlr --seed 123456789" or "mlr --seed 0xcafefeed".
Miller's built-in variables are NF, NR, FNR, FILENUM, and FILENAME (awk-like)
along with the mathematical constants PI and E.

User-defined functions and subroutines

As of Miller 4.6.0 you can define your own functions, as well as subroutines.

User-defined functions

Here’s the obligatory example of a recursive function to compute the factorial function:

$ mlr --opprint --from data/small put '
    func f(n) {
        if (isnumeric(n)) {
            if (n > 0) {
                return n * f(n-1);
            } else {
                return 1;
            }
        }
        # implicitly return absent-null if non-numeric
    }
    $ox = f($x + NR);
    $oi = f($i);
'
a   b   i x                   y                   ox         oi
pan pan 1 0.3467901443380824  0.7268028627434533  0.467054   1
eks pan 2 0.7586799647899636  0.5221511083334797  3.680838   2
wye wye 3 0.20460330576630303 0.33831852551664776 1.741251   6
eks wye 4 0.38139939387114097 0.13418874328430463 18.588349  24
wye pan 5 0.5732889198020006  0.8636244699032729  211.387310 120

Properties of user-defined functions:

xxx uncached/cached fibo examples.

User-defined subroutines

xxx this section is under construction.

A note on the complexity of Miller’s expression language

One of Miller’s strengths is its brevity: it’s much quicker — and less error-prone — to type mlr stats1 -a sum -f x,y -g a,b than having to track summation variables as in awk, or using Miller’s out-of-stream variables. And the more language features Miller’s put-DSL has (for-loops, if-statements, nested control structures, etc.) then the less powerful it begins to seem: because of the other programming-language features it doesn’t have.

When I was originally prototyping Miller in 2015, the decision I had was whether to hand-code in a low-level language like C or Rust, with my own hand-rolled DSL, or whether to use a higher-level language (like Python or Lua or Nim) and let the put statements be handled by the implementation language’s own eval: the implementation language would take the place of a DSL. Multiple performance experiments showed me I could get better throughput using the former, and using C in particular — by a wide margin. So Miller is C under the hood with a hand-rolled DSL.

I do want to keep focusing on what Miller is good at — concise notation, low latency, and high throughput — and not add too much in terms of high-level-language features to the DSL. That said, some sort of looping over field names is a basic thing to want. As of 4.1.0 we have recursive for/while/if structures on about the same complexity level as awk. While I’m excited by these powerful language features, I hope to keep new features beyond 4.1.0 focused on Miller’s sweet spot which is speed plus simplicity.

then-chaining

In accord with the Unix philosophy, you can pipe data into or out of Miller. For example:

mlr cut --complement -f os_version *.dat | mlr sort -f hostname,uptime

You can, if you like, instead simply chain commands together using the then keyword:

mlr cut --complement -f os_version then sort -f hostname,uptime *.dat

Here’s a performance comparison:

% cat piped.sh
mlr cut -x -f i,y data/big | mlr sort -n y > /dev/null

% time sh piped.sh
real 0m2.828s
user 0m3.183s
sys  0m0.137s


% cat chained.sh
mlr cut -x -f i,y then sort -n y data/big > /dev/null

% time sh chained.sh
real 0m2.082s
user 0m1.933s
sys  0m0.137s

There are two reasons to use then-chaining: one is for performance, although I don’t expect this to be a win in all cases. Using then-chaining avoids redundant string-parsing and string-formatting at each pipeline step: instead input records are parsed once, they are fed through each pipeline stage in memory, and then output records are formatted once. On the other hand, Miller is single-threaded, while modern systems are usually multi-processor, and when streaming-data programs operate through pipes, each one can use a CPU. Rest assured you get the same results either way.

The other reason to use then-chaining is for simplicity: you don’t have re-type formatting flags (e.g. --csv --rs lf --fs tab) at every pipeline stage.

Data types

Miller’s input and output are all string-oriented: there is (as of August 2015 anyway) no support for binary record packing. In this sense, everything is a string in and out of Miller. During processing, field names are always strings, even if they have names like "3"; field values are usually strings. Field values’ ability to be interpreted as a non-string type only has meaning when comparison or function operations are done on them. And it is an error condition if Miller encounters non-numeric (or otherwise mistyped) data in a field in which it has been asked to do numeric (or otherwise type-specific) operations.

Field values are treated as numeric for the following:

  • Numeric sort: mlr sort -n, mlr sort -nr.
  • Statistics: mlr histogram, mlr stats1, mlr stats2.
  • Cross-record arithmetic: mlr step.

For mlr put and mlr filter:

  • Miller’s types for function processing are null (empty string), error, string, float (double-precision), int (64-bit signed), and boolean.
  • On input, string values representable as numbers, e.g. "3" or "3.1", are treated as int or float, respectively. If a record has x=1,y=2 then mlr put '$z=$x+$y' will produce x=1,y=2,z=3, and mlr put '$z=$x.$y' gives an error. To coerce back to string for processing, use the string function: mlr put '$z=string($x).string($y)' will produce x=1,y=2,z=12.
  • On input, string values representable as boolean (e.g. "true", "false") are not automatically treated as boolean. (This is because "true" and "false" are ordinary words, and auto string-to-boolean on a column consisting of words would result in some strings mixed with some booleans.) Use the boolean function to coerce: e.g. giving the record x=1,y=2,w=false to mlr put '$z=($x<$y) || boolean($w)'.
  • Functions take types as described in mlr --help-all-functions: for example, log10 takes float input and produces float output, gmt2sec maps string to int, and sec2gmt maps int to string.
  • All math functions described in mlr --help-all-functions take integer as well as float input.

Null data: empty and absent

One of Miller’s key features is its support for heterogeneous data. For example, take mlr sort: if you try to sort on field hostname when not all records in the data stream have a field named hostname, it is not an error (although you could pre-filter the data stream using mlr having-fields --at-least hostname then sort ...). Rather, records lacking one or more sort keys are simply output contiguously by mlr sort.

Miller has two kinds of null data:

  • Empty: a field name is present in a record (or in an out-of-stream variable) with empty value: e.g. x=,y=2 in the data input stream, or assignment $x="" or @x="" in mlr put.
  • Absent: a field name is not present, e.g. input record is x=1,y=2 and a put or filter expression refers to $z. Or, reading an out-of-stream variable which hasn’t been assigned a value yet, e.g. mlr put -q '@sum += $x'; end{emit @sum}' or mlr put -q '@sum[$a][$b] += $x'; end{emit @sum, "a", "b"}'.

You can test these programatically using the functions isempty/isnotempty, isabsent/ispresent, and isnull/isnotnull. For the last pair, note that null means either empty or absent.

Rules for null-handling:

  • Records with one or more empty sort-field values sort after records with all sort-field values present:

    $ mlr cat data/sort-null.dat
    a=3,b=2
    a=1,b=8
    a=,b=4
    x=9,b=10
    a=5,b=7
    

    $ mlr sort -n  a data/sort-null.dat
    a=1,b=8
    a=3,b=2
    a=5,b=7
    a=,b=4
    x=9,b=10
    

    $ mlr sort -nr a data/sort-null.dat
    a=,b=4
    a=5,b=7
    a=3,b=2
    a=1,b=8
    x=9,b=10
    

  • Functions/operators which have one or more empty arguments produce empty output: e.g.

    $ echo 'x=2,y=3' | mlr put '$a=$x+$y'
    x=2,y=3,a=5
    

    $ echo 'x=,y=3' | mlr put '$a=$x+$y'
    x=,y=3,a=
    

    $ echo 'x=,y=3' | mlr put '$a=log($x);$b=log($y)'
    x=,y=3,a=,b=1.098612
    

    with the exception that the min and max functions are special: if one argument is non-null, it wins:

    $ echo 'x=,y=3' | mlr put '$a=min($x,$y);$b=max($x,$y)'
    x=,y=3,a=3,b=3
    

  • Functions of absent variables (e.g. mlr put '$y = log10($nonesuch)') evaluate to absent, and arithmetic/bitwise/boolean operators with both operands being absent evaluate to absent. Arithmetic operators with one absent operand return the other operand. More specifically, absent values act like zero for addition/subtraction, and one for multiplication: Furthermore, any expression which evaluates to absent is not stored in the output record:

    $ echo 'x=2,y=3' | mlr put '$a=$u+$v; $b=$u+$y; $c=$x+$y'
    x=2,y=3,b=3,c=5
    

    $ echo 'x=2,y=3' | mlr put '$a=min($x,$v);$b=max($u,$y);$c=min($u,$v)'
    x=2,y=3,a=2,b=3
    

The reasoning is as follows:
  • Empty values are explicit in the data so they should explicitly affect accumulations: mlr put '@sum += $x' should accumulate numeric x values into the sum but an empty x, when encountered in the input data stream, should make the sum non-numeric. To work around this you can use the isnotnull function as follows: mlr put 'isnotnull($x) { @sum += $x }'
  • Absent stream-record values should not break accumulations, since Miller by design handles heterogenous data: the running @sum in mlr put '@sum += $x' should not be invalidated for records which have no x.
  • Absent out-of-stream-variable values are precisely what allow you to write mlr put '@sum += $x'. Otherwise you would have to write mlr put 'begin{@sum = 0}; @sum += $x' — which is tolerable — but for mlr put 'begin{...}; @sum[$a][$b] += $x' you’d have to pre-initialize @sum for all values of $a and $b in your input data stream, which is intolerable.
  • The penalty for the absent feature is that misspelled variables can be hard to find: e.g. in mlr put 'begin{@sumx = 10}; ...; update @sumx somehow per-record; ...; end {@something = @sum * 2}' the accumulator is spelt @sumx in the begin-block but @sum in the end-block, where since it is absent, @sum*2 evaluates to 2.

Since absent plus absent is absent (and likewise for other operators), accumulations such as @sum += $x work correctly on heterogenous data, as do within-record formulas if both operands are absent. If one operand is present, you may get behavior you don’t desire. To work around this — namely, to set an output field only for records which have all the inputs present — you can use a pattern-action block with ispresent:

$ mlr cat data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true
record_count=100,resource=/path/to/file
resource=/path/to/second/file,loadsec=0.32,ok=true
record_count=150,resource=/path/to/second/file
resource=/some/other/path,loadsec=0.97,ok=false

$ mlr put 'ispresent($loadsec) { $loadmillis = $loadsec * 1000 }' data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true,loadmillis=450.000000
record_count=100,resource=/path/to/file
resource=/path/to/second/file,loadsec=0.32,ok=true,loadmillis=320.000000
record_count=150,resource=/path/to/second/file
resource=/some/other/path,loadsec=0.97,ok=false,loadmillis=970.000000

$ mlr put '$loadmillis = (ispresent($loadsec) ? $loadsec : 0.0) * 1000' data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true,loadmillis=450.000000
record_count=100,resource=/path/to/file,loadmillis=0.000000
resource=/path/to/second/file,loadsec=0.32,ok=true,loadmillis=320.000000
record_count=150,resource=/path/to/second/file,loadmillis=0.000000
resource=/some/other/path,loadsec=0.97,ok=false,loadmillis=970.000000

If you’re interested in a formal description of how empty and absent fields participate in arithmetic, here’s a table for plus (other arithmetic/boolean/bitwise operators are similar):

$ mlr --print-type-arithmetic-info
(+)    | error  absent empty  string int    float  bool
------ + ------ ------ ------ ------ ------ ------ ------
error  | error  error  error  error  error  error  error
absent | error  absent absent error  int    float  error
empty  | error  absent empty  error  empty  empty  error
string | error  error  error  error  error  error  error
int    | error  int    empty  error  int    float  error
float  | error  float  empty  error  float  float  error
bool   | error  error  error  error  error  error  error

String literals

You can use the following backslash escapes for strings such as between the double quotes in contexts such as mlr filter '$name =~ "..."', mlr put '$name = $othername . "..."', mlr put '$name = sub($name, "...", "..."), etc.:

  • \a: ASCII code 0x07 (alarm/bell)
  • \b: ASCII code 0x08 (backspace)
  • \f: ASCII code 0x0c (formfeed)
  • \n: ASCII code 0x0a (LF/linefeed/newline)
  • \r: ASCII code 0x0d (CR/carriage return)
  • \t: ASCII code 0x09 (tab)
  • \v: ASCII code 0x0b (vertical tab)
  • \\: backslash
  • \": double quote
  • \123: Octal 123, etc. for \000 up to \377
  • \x7f: Hexadecimal 7f, etc. for \x00 up to \xff

See also https://en.wikipedia.org/wiki/Escape_sequences_in_C.

These replacements apply only to strings you key in for the DSL expressions for filter and put: that is, if you type \t in a string literal for a filter/put expression, it will be turned into a tab character. If you want a backslash followed by a t, then please type \\t.

However, these replacements are not done automatically within your data stream. If you wish to make these replacements, you can do, for example, for a field named field, mlr put '$field = gsub($field, "\\t", "\t")'. If you need to make such a replacement for all fields in your data, you should probably simply use the system sed command.

Regular expressions

Miller lets you use regular expressions (of type POSIX.2) in the following contexts:

  • In mlr filter with =~ or !=~, e.g. mlr filter '$url =~ "http.*com"'
  • In mlr put with sub or gsub, e.g. mlr put '$url = sub($url, "http.*com", "")'
  • In mlr having-fields, e.g. mlr having-fields --any-matching '^sda[0-9]'
  • In mlr cut, e.g. mlr cut -r -f '^status$,^sda[0-9]'
  • In mlr rename, e.g. mlr rename -r '^(sda[0-9]).*$,dev/\1'
  • In mlr grep, e.g. mlr --csv grep 00188555487 myfiles*.csv

Points demonstrated by the above examples:

  • There are no implicit start-of-string or end-of-string anchors; please use ^ and/or $ explicitly.
  • Miller regexes are wrapped with double quotes rather than slashes.
  • The i after the ending double quote indicates a case-insensitive regex.
  • Capture groups are wrapped with (...) rather than \(...\); use \( and \) to match against parentheses.

For filter and put, if the regular expression is a string literal (the normal case), it is precompiled at process start and reused thereafter, which is efficient. If the regular expression is a more complex expression, including string concatenation using ., or a column name (in which case you can take regular expressions from input data!), then regexes are compiled on each record which works but is less efficient. As well, in this case there is no way to specify case-insensitive matching.

Example:

$ cat data/regex-in-data.dat
name=jane,regex=^j.*e$
name=bill,regex=^b[ou]ll$
name=bull,regex=^b[ou]ll$

$ mlr filter '$name =~ $regex' data/regex-in-data.dat
name=jane,regex=^j.*e$
name=bull,regex=^b[ou]ll$

Regex captures

Regex captures of the form \0 through \9 are supported as follows:

  • Captures have in-function context for sub and gsub. For example, the first \1,\2 pair belong to the first sub and the second \1,\2 pair belong to the second sub:

    mlr put '$b = sub($a, "(..)_(...)", "\2-\1"); $c = sub($a, "(..)_(.)(..)", ":\1:\2:\3")'
    
  • Captures endure for the entirety of a put for the =~ and !=~ operators. For example, here the \1,\2 are set by the =~ operator and are used by both subsequent assignment statements:

    mlr put '$a =~ "(..)_(....); $b = "left_\1"; $c = "right_\2"'
    
  • The captures are not retained across multiple puts. For example, here the \1,\2 won’t be expanded from the regex capture:

    mlr put '$a =~ "(..)_(....)' then {... something else ...} then put '$b = "left_\1"; $c = "right_\2"'
    
  • Captures are ignored in filter for the =~ and !=~ operators. For example, there is no mechanism provided to refer to the first (..) as \1 or to the second (....) as \2 in the following filter statement:

    mlr filter '$a =~ "(..)_(....)'
    
  • Up to nine matches are supported: \1 through \9, while \0 is the entire match string; \15 is treated as \1 followed by an unrelated 5.

Operator precedence

Operators are listed in order of decreasing precedence, highest first.

Operators              Associativity
---------              -------------
()                     left to right
**                     right to left
! ~ unary+ unary- &    right to left
binary* / // %         left to right
binary+ binary- .      left to right
<< >>                  left to right
&                      left to right
^                      left to right
|                      left to right
< <= > >=              left to right
== != =~ !=~           left to right
&&                     left to right
^^                     left to right
||                     left to right
? :                    right to left
=                      N/A for Miller (there is no $a=$b=$c)

Operator and function semantics

  • Functions are in general pass-throughs straight to the system-standard C library.
  • The min and max functions are different from other multi-argument functions which return null if any of their inputs are null: for min and max, by contrast, if one argument is null, the other is returned.
  • Symmetrically with respect to the bitwise OR, XOR, and AND operators |, ^, &, Miller has logical operators ||, ^^, &&: the logical XOR not existing in C.
  • The exponentiation operator ** is familiar from many languages.
  • The regex-match and regex-not-match operators =~ and !=~ are similar to those in Ruby and Perl.

Arithmetic

Input scanning

Numbers in Miller are double-precision float or 64-bit signed integers. Anything scannable as int, e.g 123 or 0xabcd, is treated as an integer; otherwise, input scannable as float (4.56 or 8e9) is treated as float; everything else is a string.

If you want all numbers to be treated as floats, then you may use float() in your filter/put expressions (e.g. replacing $c = $a * $b with $c = float($a) * float($b)) — or, more simply, use mlr filter -F and mlr put -F which forces all numeric input, whether from expression literals or field values, to float. Likewise mlr stats1 -F and mlr step -F force integerable accumulators (such as count) to be done in floating-point.

Conversion by math routines

For most math functions, integers are cast to float on input, and produce float output: e.g. exp(0) = 1.0 rather than 1. The following, however, produce integer output if their inputs are integers: + - * / // % abs ceil floor max min round roundm sgn. As well, stats1 -a min, stats1 -a max, stats1 -a sum, step -a delta, and step -a rsum produce integer output if their inputs are integers.

Conversion by arithmetic operators

The sum, difference, and product of integers is again integer, except for when that would overflow a 64-bit integer at which point Miller converts the result to float.

The short of it is that Miller does this transparently for you so you needn’t think about it.

Implementation details of this, for the interested: integer adds and subtracts overflow by at most one bit so it suffices to check sign-changes. Thus, Miller allows you to add and subtract arbitrary 64-bit signed integers, converting only to float precisely when the result is less than -263 or greater than 263-1. Multiplies, on the other hand, can overflow by a word size and a sign-change technique does not suffice to detect overflow. Instead Miller tests whether the floating-point product exceeds the representable integer range. Now, 64-bit integers have 64-bit precision while IEEE-doubles have only 52-bit mantissas — so, there are 53 bits including implicit leading one. The following experiment explicitly demonstrates the resolution at this range:

   64-bit integer     64-bit integer     Casted to double           Back to 64-bit
       in hex           in decimal                                    integer
0x7ffffffffffff9ff 9223372036854774271 9223372036854773760.000000 0x7ffffffffffff800
0x7ffffffffffffa00 9223372036854774272 9223372036854773760.000000 0x7ffffffffffff800
0x7ffffffffffffbff 9223372036854774783 9223372036854774784.000000 0x7ffffffffffffc00
0x7ffffffffffffc00 9223372036854774784 9223372036854774784.000000 0x7ffffffffffffc00
0x7ffffffffffffdff 9223372036854775295 9223372036854774784.000000 0x7ffffffffffffc00
0x7ffffffffffffe00 9223372036854775296 9223372036854775808.000000 0x8000000000000000
0x7ffffffffffffffe 9223372036854775806 9223372036854775808.000000 0x8000000000000000
0x7fffffffffffffff 9223372036854775807 9223372036854775808.000000 0x8000000000000000

That is, one cannot check an integer product to see if it is precisely greater than 263-1 or less than -263 using either integer arithmetic (it may have already overflowed) or using double-precision (due to granularity). Instead Miller checks for overflow in 64-bit integer multiplication by seeing whether the absolute value of the double-precision product exceeds the largest representable IEEE double less than 263, which we see from the listing above is 9223372036854774784. (An alternative would be to do all integer multiplies using handcrafted multi-word 128-bit arithmetic. This approach is not taken.)

Pythonic division

Division and remainder are pythonic:

  • Quotient of integers is floating-point: 7/2 is 3.5.
  • Integer division is done with //: 7/2 is 3. This rounds toward the negative.
  • Remainders are non-negative.