Skip to content

Separators

Record, field, and pair separators

Miller has record separators, field separators, and pair separators. For example, given the following DKVP records:

cat data/a.dkvp
a=1,b=2,c=3
a=4,b=5,c=6
  • the record separator is newline -- it separates records from one another;
  • the field separator is , -- it separates fields (key-value pairs) from one another;
  • and the pair separator is = -- it separates the key from the value within each key-value pair.

These are the default values, which you can override with flags such as --ips and --ops (below).

Not all file formats have all three of these: for example, CSV does not have a pair separator, since keys are on the header line and values are on each data line.

Also, separators are not programmable for all file formats. For example, in JSON objects, the pair separator is : and the field-separator is , -- we write {"a":1,"b":2,"c":3} -- but these aren't modifiable. If you do mlr --json --ips : --ips '=' cat myfile.json then you don't get {"a"=1,"b"=2,"c"=3}. This is because the pair-separator : is part of the JSON specification.

Input and output separators

Miller lets you use the same separators for input and output (e.g. CSV input, CSV output), or, to change them between input and output (e.g. CSV input, JSON output), if you wish to transform your data in that way.

Miller uses the names IRS and ORS for the input and output record separators, IFS and OFS for the input and output field separators, and IPS and OPS for input and output pair separators.

For example:

cat data/a.dkvp
a=1,b=2,c=3
a=4,b=5,c=6
mlr --ifs , --ofs ';' --ips = --ops : cut -o -f c,a,b data/a.dkvp
c:3;a:1;b:2
c:6;a:4;b:5

If your data has non-default separators and you don't want to change those between input and output, you can use --rs, --fs, and --ps. Setting --fs : is the same as setting --ifs : --ofs :, but with fewer keystrokes.

cat data/modsep.dkvp
a:1;b:2;c:3
a:4;b:5;c:6
mlr --fs ';' --ps : cut -o -f c,a,b data/modsep.dkvp
c:3;a:1;b:2
c:6;a:4;b:5

Multi-character and regular-expression separators

The separators default to single characters, but can be multiple characters if you like:

mlr --ifs ';' --ips : --ofs ';;;' --ops := cut -o -f c,a,b data/modsep.dkvp
c:=3;;;a:=1;;;b:=2
c:=6;;;a:=4;;;b:=5

As of September 2021:

  • IFS and IPS can be regular expressions.
  • IRS must be a single character (nominally \n).
  • OFS, OPS, and ORS can be multi-character.

Since IFS and IPS can be regular expressions, if your data has field separators which are one or more consecutive spaces, you can use --ifs '( )+'. But that gets a little tedious, so Miller has the --repifs and --repips flags you can use if you like. This wraps the IFS or IPS, say X, as (X)+.

The --repifs flag means that multiple successive occurrences of the field separator count as one. For example, in CSV data we often signify nulls by empty strings, e.g. 2,9,,,,,6,5,4. On the other hand, if the field separator is a space, it might be more natural to parse 2 4 5 the same as 2 4 5: --repifs --ifs ' ' lets this happen. In fact, the --ipprint option above is internally implemented in terms of --repifs.

For example:

cat data/extra-spaces.txt
oh    say   can you
see   by    the dawn's
early light what so
mlr --ifs ' ' --repifs --inidx --oxtab cat  data/extra-spaces.txt
1 oh
2 say
3 can
4 you

1 see
2 by
3 the
4 dawn's

1 early
2 light
3 what
4 so

Aliases

Many things we'd like to write as separators need to be escaped from the shell -- e.g. --ifs ';' or --ofs '|', and so on. You can use the following if you like:

mlr help list-separator-aliases
ascii_esc  = "\x1b"
ascii_etx  = "\x04"
ascii_fs   = "\x1c"
ascii_gs   = "\x1d"
ascii_null = "\x01"
ascii_rs   = "\x1e"
ascii_soh  = "\x02"
ascii_stx  = "\x03"
ascii_us   = "\x1f"
asv_fs     = "\x1f"
asv_rs     = "\x1e"
colon      = ":"
comma      = ","
cr         = "\r"
crcr       = "\r\r"
crlf       = "\r\n"
crlfcrlf   = "\r\n\r\n"
equals     = "="
lf         = "\n"
lflf       = "\n\n"
newline    = "\n"
pipe       = "|"
semicolon  = ";"
slash      = "/"
space      = " "
spaces     = "( )+"
tab        = "\t"
tabs       = "(\t)+"
usv_fs     = "\xe2\x90\x9f"
usv_rs     = "\xe2\x90\x9e"
whitespace = "([ \t])+"

Note that spaces, tabs, and whitespace already are regexes so you shouldn't use --repifs with them.

Command-line flags

Given the above, we now have seen the following flags:

--rs --irs --ors
--fs --ifs --ofs --repifs
--ps --ips --ops

See also the separator-flags section.

DSL built-in variables

Miller exposes for you read-only built-in variables with names IRS, ORS, IFS, OFS, IPS, and OPS. Unlike in AWK, you can't set these in begin-blocks -- their values indicate what you specified at the command line -- so their use is limited.

mlr --ifs , --ofs ';' --ips = --ops : --from data/a.dkvp put '$d = ">>>" . IFS . "|||" . OFS . "<<<"'
a:1;b:2;c:3;d:>>>,|||;<<<
a:4;b:5;c:6;d:>>>,|||;<<<

Which separators apply to which file formats

Notes:

  • If CSV field separator is tab, we have TSV; see more examples (ASV, USV, etc.) at in the CSV section.
  • JSON: ignores all separator flags from the command line.
  • Headerless CSV overlaps quite a bit with NIDX format using comma for IFS. See also the page on CSV with and without headers.
RS FS PS
CSV and CSV-lite Default \n * Default , None
TSV Default \n * Default \t None
JSON N/A; records are between { and } , but not alterable : but not alterable
DKVP Default \n Default , Default =
NIDX Default \n Default space None
XTAB \n\n ** \n * Space with repeats
PPRINT Default \n * Space with repeats None
Markdown \n * but not alterable One or more spaces then | then one or more spaces; not alterable None

* or \r\n on Windows

** or \r\n\r\n on Windows