Quick links: Flags Verbs Functions Glossary Release docs

Regular expressions

Miller lets you use regular expressions (of the types accepted by Go) in the following contexts:

In mlr filter with =~ or !=~, e.g. mlr filter '$url =~ "http.*com"'
In mlr put with sub or gsub, e.g. mlr put '$url = sub($url, "http.*com", "")'
In mlr having-fields, e.g. mlr having-fields --any-matching '^sda[0-9]'
In mlr cut, e.g. mlr cut -r -f '^status$,^sda[0-9]'
In mlr rename, e.g. mlr rename -r '^(sda[0-9]).*$,dev/\1'
In mlr grep, e.g. mlr --csv grep 00188555487 myfiles*.csv

Points demonstrated by the above examples:

There are no implicit start-of-string or end-of-string anchors; please use ^ and/or $ explicitly.
Miller regexes are wrapped with double quotes rather than slashes.
The i after the ending double quote indicates a case-insensitive regex.
Capture groups are wrapped with (...) rather than $...$; use $ and $ to match against parentheses.

Example:

cat data/regex-in-data.dat

name=jane,regex=^j.*e$
name=bill,regex=^b[ou]ll$
name=bull,regex=^b[ou]ll$

mlr filter '$name =~ $regex' data/regex-in-data.dat

name=jane,regex=^j.*e$
name=bull,regex=^b[ou]ll$

Regex captures

Regex captures of the form \0 through \9 are supported as

Captures have in-function context for sub and gsub. For example, the first \1,\2 pair belong to the first sub and the second \1,\2 pair belong to the second sub:

mlr put '$b = sub($a, "(..)_(...)", "\2-\1"); $c = sub($a, "(..)_(.)(..)", ":\1:\2:\3")'

Captures endure for the entirety of a put for the =~ and !=~ operators. For example, here the \1,\2 are set by the =~ operator and are used by both subsequent assignment statements:

mlr put '$a =~ "(..)_(....); $b = "left_\1"; $c = "right_\2"'

The captures are not retained across multiple puts. For example, here the \1,\2 won't be expanded from the regex capture:

mlr put '$a =~ "(..)_(....)' then {... something else ...} then put '$b = "left_\1"; $c = "right_\2"'

Up to nine matches are supported: \1 through \9, while \0 is the entire match string; \15 is treated as \1 followed by an unrelated 5.

More information

Regular expressions are those supported by the Go regexp package, which in turn are of type RE2 except for \C:

go doc regexp/syntax

package syntax // import "regexp/syntax"

Package syntax parses regular expressions into parse trees and compiles
parse trees into programs. Most clients of regular expressions will use the
facilities of package regexp (such as Compile and Match) instead of this
package.


Syntax

The regular expression syntax understood by this package when parsing with
the Perl flag is as follows. Parts of the syntax can be disabled by passing
alternate flags to Parse.

Single characters:

    .              any character, possibly including newline (flag s=true)
    [xyz]          character class
    [^xyz]         negated character class
    \d             Perl character class
    \D             negated Perl character class
    [[:alpha:]]    ASCII character class
    [[:^alpha:]]   negated ASCII character class
    \pN            Unicode character class (one-letter name)
    \p{Greek}      Unicode character class
    \PN            negated Unicode character class (one-letter name)
    \P{Greek}      negated Unicode character class

Composites:

    xy             x followed by y
    x|y            x or y (prefer x)

Repetitions:

    x*             zero or more x, prefer more
    x+             one or more x, prefer more
    x?             zero or one x, prefer one
    x{n,m}         n or n+1 or ... or m x, prefer more
    x{n,}          n or more x, prefer more
    x{n}           exactly n x
    x*?            zero or more x, prefer fewer
    x+?            one or more x, prefer fewer
    x??            zero or one x, prefer zero
    x{n,m}?        n or n+1 or ... or m x, prefer fewer
    x{n,}?         n or more x, prefer fewer
    x{n}?          exactly n x

Implementation restriction: The counting forms x{n,m}, x{n,}, and x{n}
reject forms that create a minimum or maximum repetition count above 1000.
Unlimited repetitions are not subject to this restriction.

Grouping:

    (re)           numbered capturing group (submatch)
    (?Pre)   named & numbered capturing group (submatch)
    (?:re)         non-capturing group
    (?flags)       set flags within current group; non-capturing
    (?flags:re)    set flags during re; non-capturing

    Flag syntax is xyz (set) or -xyz (clear) or xy-z (set xy, clear z). The flags are:

    i              case-insensitive (default false)
    m              multi-line mode: ^ and $ match begin/end line in addition to begin/end text (default false)
    s              let . match \n (default false)
    U              ungreedy: swap meaning of x* and x*?, x+ and x+?, etc (default false)

Empty strings:

    ^              at beginning of text or line (flag m=true)
    $              at end of text (like \z not \Z) or line (flag m=true)
    \A             at beginning of text
    \b             at ASCII word boundary (\w on one side and \W, \A, or \z on the other)
    \B             not at ASCII word boundary
    \z             at end of text

Escape sequences:

    \a             bell (== \007)
    \f             form feed (== \014)
    \t             horizontal tab (== \011)
    \n             newline (== \012)
    \r             carriage return (== \015)
    \v             vertical tab character (== \013)
    \*             literal *, for any punctuation character *
    \123           octal character code (up to three digits)
    \x7F           hex character code (exactly two digits)
    \x{10FFFF}     hex character code
    \Q...\E        literal text ... even if ... has punctuation

Character class elements:

    x              single character
    A-Z            character range (inclusive)
    \d             Perl character class
    [:foo:]        ASCII character class foo
    \p{Foo}        Unicode character class Foo
    \pF            Unicode character class F (one-letter name)

Named character classes as character class elements:

    [\d]           digits (== \d)
    [^\d]          not digits (== \D)
    [\D]           not digits (== \D)
    [^\D]          not not digits (== \d)
    [[:name:]]     named ASCII class inside character class (== [:name:])
    [^[:name:]]    named ASCII class inside negated character class (== [:^name:])
    [\p{Name}]     named Unicode property inside character class (== \p{Name})
    [^\p{Name}]    named Unicode property inside negated character class (== \P{Name})

Perl character classes (all ASCII-only):

    \d             digits (== [0-9])
    \D             not digits (== [^0-9])
    \s             whitespace (== [\t\n\f\r ])
    \S             not whitespace (== [^\t\n\f\r ])
    \w             word characters (== [0-9A-Za-z_])
    \W             not word characters (== [^0-9A-Za-z_])

ASCII character classes:

    [[:alnum:]]    alphanumeric (== [0-9A-Za-z])
    [[:alpha:]]    alphabetic (== [A-Za-z])
    [[:ascii:]]    ASCII (== [\x00-\x7F])
    [[:blank:]]    blank (== [\t ])
    [[:cntrl:]]    control (== [\x00-\x1F\x7F])
    [[:digit:]]    digits (== [0-9])
    [[:graph:]]    graphical (== [!-~] == [A-Za-z0-9!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~])
    [[:lower:]]    lower case (== [a-z])
    [[:print:]]    printable (== [ -~] == [ [:graph:]])
    [[:punct:]]    punctuation (== [!-/:-@[-`{-~])
    [[:space:]]    whitespace (== [\t\n\v\f\r ])
    [[:upper:]]    upper case (== [A-Z])
    [[:word:]]     word characters (== [0-9A-Za-z_])
    [[:xdigit:]]   hex digit (== [0-9A-Fa-f])

Unicode character classes are those in unicode.Categories and
unicode.Scripts.

func IsWordChar(r rune) bool
type EmptyOp uint8
    const EmptyBeginLine EmptyOp = 1 << iota ...
    func EmptyOpContext(r1, r2 rune) EmptyOp
type Error struct{ ... }
type ErrorCode string
    const ErrInternalError ErrorCode = "regexp/syntax: internal error" ...
type Flags uint16
    const FoldCase Flags = 1 << iota ...
type Inst struct{ ... }
type InstOp uint8
    const InstAlt InstOp = iota ...
type Op uint8
    const OpNoMatch Op = 1 + iota ...
type Prog struct{ ... }
    func Compile(re *Regexp) (*Prog, error)
type Regexp struct{ ... }
    func Parse(s string, flags Flags) (*Regexp, error)

One caveat: for strings in "regex position" -- e.g. the second argument to sub or gsub, or after =~ -- "\t" means a backslash and a t -- which is the right thing -- whereas for strings in "non-regex position", e.g. anywhere else, "\t" becomes the tab character. This is to say (if you're familiar with r-strings in Python) all strings in regex position are implicit r-strings. Generally this is the right thing and should cause little confusion. Note however that this means "\t"."\t" in the second argument to sub isn't the same as "\t\t".