Strings¶
Essentials¶
Miller string literals are always written with double quotes, like "abcde"
; single quotes
are not part of the grammar of Miller's programming language.
Single quotes are used for wrapping put
/filter
statements, as in mlr put '$b=$a.".suffix"' myfile.csv'
:
the single-quotes are consumed by the shell and Miller gets $b=$a.".suffix"
. (See however the
Miller on Windows page.)
A basic string operation is the .
(concatenation) operator:
mlr --c2p --from example.csv put '$output = $color . ":" . $shape'
color shape flag k index quantity rate output yellow triangle true 1 11 43.6498 9.8870 yellow:triangle red square true 2 15 79.2778 0.0130 red:square red circle true 3 16 13.8103 2.9010 red:circle red square false 4 48 77.5542 7.4670 red:square purple triangle false 5 51 81.2290 8.5910 purple:triangle red square false 6 64 77.1991 9.5310 red:square purple triangle false 7 65 80.1405 5.8240 purple:triangle yellow circle true 8 73 63.9785 4.2370 yellow:circle yellow circle true 9 87 63.5058 8.3350 yellow:circle purple square false 10 91 72.3735 8.2430 purple:square
Also see the list of string-related built-in functions.
1-up indexing¶
The most important difference between Miller's strings and strings in other languages is that indices start with 1, not 0. (The same is true for Miller arrays.) This is intentional.
1-up indices may feel like a thing of the past, belonging to Fortran and Matlab, say; or R and Julia as well, which are more modern. But the overall trend is decidedly toward 0-up. This means that if Miller does 1-up indices, it should do so for good reasons.
Miller strings are indexed 1-up simply because Miller arrays are indexed 1-up. See this section for the reasoning.
Strings have been in Miller since the beginning, but they weren't accessible
using indices or slices until Miller 6. Also, the
substr
function predates Miller
6. This function was implemented to take 0-up indices. When Miller 6 was
implemented, this became inconsistent. As a result, there are
substr0
and
substr1
functions. For backward
compatibility with existing Miller scripts, substr
is the same as substr0
.
But users starting out with Miller 6 will probably want substr1
.
Negative-index aliasing¶
Imitating Python and other languages, you can use negative indices to read
backward from the end of the string, while positive indices read forward from
the start. If a string has length n
then -n..-1
are aliases for 1..n
,
respectively; 0 is never a valid string index in Miller.
mlr -n put ' end { x = "abcde"; print x[1]; print x[-1]; print x[1:2]; print x[-2:-1]; } '
a e ab de
Slicing¶
Miller supports slicing using [lo:hi]
syntax. Either or both of the indices
in a slice can be negatively aliased as described above. Unlike in Python,
Miller string-slice indices are inclusive on both sides: x[3:5]
means x[3] . x[4] . x[5]
.
mlr -n put ' end { x = "abcde"; print x[3:4]; print x[:2]; print x[3:]; print x[1:-1]; print x[2:-2]; } '
cd ab cde abcde bcd
Out-of-bounds indexing¶
Out-of-bounds index accesses are errors, but out-of-bounds slice accesses result in trimming the indices, resulting in a short string or even the empty string. (This behavior intentionally imitates Python.)
mlr -n put ' end { x = "abcde"; print x[1]; print x[5]; print x[6]; } '
a e (error)
mlr -n put ' end { x = "abcde"; print "\"" . x[1:2] . "\""; print "\"" . x[1:6] . "\""; print "\"" . x[10:20] . "\""; } '
"ab" "abcde" ""
Escape sequences for string literals¶
You can use the following backslash escapes for strings such as between the double quotes in contexts such as mlr filter '$name =~ "..."'
, mlr put '$name = $othername . "..."'
, mlr put '$name = sub($name, "...", "...")
, etc.:
\a
: ASCII code 0x07 (alarm/bell)\b
: ASCII code 0x08 (backspace)\f
: ASCII code 0x0c (formfeed)\n
: ASCII code 0x0a (LF/linefeed/newline)\r
: ASCII code 0x0d (CR/carriage return)\t
: ASCII code 0x09 (tab)\v
: ASCII code 0x0b (vertical tab)\\
: backslash\"
: double quote\123
: Octal 123, etc. for\000
up to\377
\x7f
: Hexadecimal 7f, etc. for\x00
up to\xff
\u2766
,\U00010877:
: Unicode literals. For technical reasons, you must supply four hex digits after\u
and eight hex digits after\U
.
mlr repl
[mlr] "a\nb" "a b" [mlr] "a\tb" "a b" [mlr] "a\x62c" "abc" [mlr] "\u2766\U00010877" "❦𐡷"
See also https://en.wikipedia.org/wiki/Escape_sequences_in_C.
These replacements apply only to strings you key in for the DSL expressions for filter
and put
: that is, if you type \t
in a string literal for a filter
/put
expression, it will be turned into a tab character. If you want a backslash followed by a t
, then please type \\t
.
However, these replacements are done automatically only for string literals within DSL expressions -- they are not done automatically to fields within your data stream. If you wish to make these replacements, you can do (for example) mlr put '$field = gsub($field, "\\t", "\t")'
. If you need to make such a replacement for all fields in your data, you should probably use the system sed
command instead.