Quick links: Flags Verbs Functions Glossary Release docs

Record-heterogeneity

We think of CSV tables as rectangular: if there are 17 columns in the header then there are 17 columns for every row, else the data have a formatting error.

But heterogeneous data abound -- log-file entries, JSON documents, no-SQL databases such as MongoDB, etc. -- not to mention data-cleaning opportunities we'll look at in this page. Miller offers several ways to handle data heterogeneity.

Terminology, examples, and solutions

Different kinds of heterogeneous data include ragged, irregular, and sparse.

Homogeneous/rectangular data

A homogeneous list of records is one in which all records have the same keys, in the same order. For example, here is a well-formed CSV file:

mlr --csv cat data/het/hom.csv

a,b,c
1,2,3
4,5,6
7,8,9

It has three records (written here using JSON Lines formatting):

mlr --icsv --ojsonl cat data/het/hom.csv

{"a": 1, "b": 2, "c": 3}
{"a": 4, "b": 5, "c": 6}
{"a": 7, "b": 8, "c": 9}

Here every row has the same keys, in the same order: a,b,c.

These are also sometimes called rectangular since if we pretty-print them we get a nice rectangle:

mlr --icsv --opprint cat data/het/hom.csv

Fillable data

A second example has some empty cells which could be filled:

mlr --csv cat data/het/fillable.csv

a,b,c
1,2,3
4,,6
,8,9

mlr --icsv --ojsonl cat data/het/fillable.csv

{"a": 1, "b": 2, "c": 3}
{"a": 4, "b": "", "c": 6}
{"a": "", "b": 8, "c": 9}

This example is still homogeneous, though: every row has the same keys, in the same order: a,b,c. Empty values don't make the data heterogeneous.

Note however that we can use the fill-down verb to make these values non-empty, if we like:

mlr --icsv --opprint fill-empty -v filler data/het/fillable.csv

a      b      c
1      2      3
4      filler 6
filler 8      9

Ragged data

Next let's look at non-well-formed CSV files. For a third example:

cat data/het/ragged.csv

a,b,c
1,2,3
4,5
7,8,9,10

If you mlr csv cat this, you'll get an error message:

mlr --csv cat data/het/ragged.csv

mlr :  mlr: CSV header/data length mismatch 3 != 2 at filename data/het/ragged.csv row 3.

There are two kinds of raggedness here. Since CSVs form records by zipping the keys from the header line together with the values from each data line, the second record has a missing value for key c (which ought to be fillable), while the third record has a value 10 with no key for it.

Using the --allow-ragged-csv-input flag we can fill values in too-short rows, and provide a key (column number starting with 1) for too-long rows:

mlr --icsv --ojson --allow-ragged-csv-input cat data/het/ragged.csv

[
{
  "a": 1,
  "b": 2,
  "c": 3
},
{
  "a": 4,
  "b": 5,
  "c": ""
},
{
  "a": 7,
  "b": 8,
  "c": 9,
  "4": 10
}
]

Irregular data

Here's another situation -- this file has, in some sense, the "same" data as our ragged.csv example above:

cat data/het/irregular.json

{"a": 1, "b": 2, "c": 3}
{"c": 6, "a": 4, "b": 5}
{"b": 8, "c": 9, "a": 7}

For example, on the second record, a is 4, b is 5, c is 6. But this data is heterogeneous because the keys a,b,c aren't in the same order in each record.

This kind of data arises often in practice. One reason is that, while many programming languages (including the Miller DSL) preserve insertion order in maps; others do not. So someone might have written {"a":4,"b":5,"c":6} in the source code, but the data may not have printed that way into a given data file.

We can use the regularize or sort-within-records verb to order the keys:

mlr --jsonl regularize data/het/irregular.json

{"a": 1, "b": 2, "c": 3}
{"a": 4, "b": 5, "c": 6}
{"a": 7, "b": 8, "c": 9}

The regularize verb tries to re-order subsequent rows to look like the first (whatever order that is); the sort-within-records verb simply uses alphabetical order (which is the same in the above example where the first record has keys in the order a,b,c).

Sparse data

Here's another frequently occurring situation -- quite often, systems will log data for items which are present, but won't log data for items which aren't.

mlr --json cat data/het/sparse.json

[
{
  "host": "xy01.east",
  "status": "running",
  "volume": "/dev/sda1"
},
{
  "host": "xy92.west",
  "status": "running"
},
{
  "purpose": "failover",
  "host": "xy55.east",
  "volume": "/dev/sda1",
  "reimaged": true
}
]

This data is called sparse (from the data-storage term).

We can use the unsparsify verb to make sure every record has the same keys:

mlr --json unsparsify data/het/sparse.json

[
{
  "host": "xy01.east",
  "status": "running",
  "volume": "/dev/sda1",
  "purpose": "",
  "reimaged": ""
},
{
  "host": "xy92.west",
  "status": "running",
  "volume": "",
  "purpose": "",
  "reimaged": ""
},
{
  "host": "xy55.east",
  "status": "",
  "volume": "/dev/sda1",
  "purpose": "failover",
  "reimaged": true
}
]

Since this data is now homogeneous (rectangular), it pretty-prints nicely:

mlr --ijson --opprint unsparsify data/het/sparse.json

host      status  volume    purpose  reimaged
xy01.east running /dev/sda1 -        -
xy92.west running -         -        -
xy55.east -       /dev/sda1 failover true

Reading and writing heterogeneous data

In the previous sections we saw different kinds of data heterogeneity, and ways to transform the data to make it homogeneous.

Non-rectangular file formats: JSON, XTAB, NIDX, DKVP

For these formats, record-heterogeneity comes naturally:

cat data/het/sparse.json

{
  "host": "xy01.east",
  "status": "running",
  "volume": "/dev/sda1"
}
{
  "host": "xy92.west",
  "status": "running"
}
{
  "purpose": "failover",
  "host": "xy55.east",
  "volume": "/dev/sda1",
  "reimaged": true
}

mlr --ijson --onidx --ofs ' ' cat data/het/sparse.json

xy01.east running /dev/sda1
xy92.west running
failover xy55.east /dev/sda1 true

mlr --ijson --oxtab cat data/het/sparse.json

host   xy01.east
status running
volume /dev/sda1

host   xy92.west
status running

purpose  failover
host     xy55.east
volume   /dev/sda1
reimaged true

mlr --ijson --odkvp cat data/het/sparse.json

host=xy01.east,status=running,volume=/dev/sda1
host=xy92.west,status=running
purpose=failover,host=xy55.east,volume=/dev/sda1,reimaged=true

Even then, we may wish to put like with like, using the group-like verb:

mlr --ijson --odkvp cat data/het.json

resource=/path/to/file,loadsec=0.45,ok=true
record_count=100,resource=/path/to/file
resource=/path/to/second/file,loadsec=0.32,ok=true
record_count=150,resource=/path/to/second/file
resource=/some/other/path,loadsec=0.97,ok=false

mlr --ijson --odkvp group-like data/het.json

resource=/path/to/file,loadsec=0.45,ok=true
resource=/path/to/second/file,loadsec=0.32,ok=true
resource=/some/other/path,loadsec=0.97,ok=false
record_count=100,resource=/path/to/file
record_count=150,resource=/path/to/second/file

Rectangular file formats: CSV and pretty-print

CSV and pretty-print formats expect rectangular structure. But Miller lets you process non-rectangular using CSV and pretty-print.

Miller simply prints a newline and a new header when there is a schema change -- where by schema we mean simply the list of record keys in the order they are encountered. When there is no schema change, you get CSV per se as a special case. Likewise, Miller reads heterogeneous CSV or pretty-print input the same way. The difference between CSV and CSV-lite is that the former is RFC-4180-compliant, while the latter readily handles heterogeneous data (which is non-compliant). For example:

cat data/het.json

{
  "resource": "/path/to/file",
  "loadsec": 0.45,
  "ok": true
}
{
  "record_count": 100,
  "resource": "/path/to/file"
}
{
  "resource": "/path/to/second/file",
  "loadsec": 0.32,
  "ok": true
}
{
  "record_count": 150,
  "resource": "/path/to/second/file"
}
{
  "resource": "/some/other/path",
  "loadsec": 0.97,
  "ok": false
}

mlr --ijson --opprint cat data/het.json

resource      loadsec ok
/path/to/file 0.45    true

record_count resource
100          /path/to/file

resource             loadsec ok
/path/to/second/file 0.32    true

record_count resource
150          /path/to/second/file

resource         loadsec ok
/some/other/path 0.97    false

mlr --ijson --opprint group-like data/het.json

resource             loadsec ok
/path/to/file        0.45    true
/path/to/second/file 0.32    true
/some/other/path     0.97    false

record_count resource
100          /path/to/file
150          /path/to/second/file

Miller handles explicit header changes as just shown. If your CSV input contains ragged data -- if there are implicit header changes (no intervening blank line and new header line) as seen above -- you can use --allow-ragged-csv-input (or keystroke-saver --ragged).

mlr --csv --ragged cat data/het/ragged.csv

a,b,c
1,2,3
4,5,

a,b,c,4
7,8,9,10

Processing heterogeneous data

Above we saw how to make heterogeneous data homogeneous, and then how to print heterogeneous data. As for other processing, record-heterogeneity is not a problem for Miller.

Miller operates on specified fields and takes the rest along: for example, if you are sorting on the count field then all records in the input stream must have a count field but the other fields can vary, and moreover the sorted-on field name(s) don't need to be in the same position on each line:

cat data/sort-het.dkvp

count=500,color=green
count=600
status=ok,count=250,hours=0.22
status=ok,count=200,hours=3.4
count=300,color=blue
count=100,color=green
count=450

mlr sort -n count data/sort-het.dkvp

count=100,color=green
status=ok,count=200,hours=3.4
status=ok,count=250,hours=0.22
count=300,color=blue
count=450
count=500,color=green
count=600