..
    PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.

Record-heterogeneity
================================================================

We think of CSV tables as rectangular: if there are 17 columns in the header then there are 17 columns for every row, else the data have a formatting error.

But heterogeneous data abound (today's no-SQL databases for example). Miller handles this.

For I/O
----------------------------------------------------------------

CSV and pretty-print
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Miller simply prints a newline and a new header when there is a schema change. When there is no schema change, you get CSV per se as a special case. Likewise, Miller reads heterogeneous CSV or pretty-print input the same way. The difference between CSV and CSV-lite is that the former is RFC4180-compliant, while the latter readily handles heterogeneous data (which is non-compliant). For example:

::

    $ cat data/het.dkvp
    resource=/path/to/file,loadsec=0.45,ok=true
    record_count=100,resource=/path/to/file
    resource=/path/to/second/file,loadsec=0.32,ok=true
    record_count=150,resource=/path/to/second/file
    resource=/some/other/path,loadsec=0.97,ok=false

::

    $ mlr --ocsvlite cat data/het.dkvp
    resource,loadsec,ok
    /path/to/file,0.45,true
    
    record_count,resource
    100,/path/to/file
    
    resource,loadsec,ok
    /path/to/second/file,0.32,true
    
    record_count,resource
    150,/path/to/second/file
    
    resource,loadsec,ok
    /some/other/path,0.97,false

::

    $ mlr --opprint cat data/het.dkvp
    resource      loadsec ok
    /path/to/file 0.45    true
    
    record_count resource
    100          /path/to/file
    
    resource             loadsec ok
    /path/to/second/file 0.32    true
    
    record_count resource
    150          /path/to/second/file
    
    resource         loadsec ok
    /some/other/path 0.97    false

Miller handles explicit header changes as just shown. If your CSV input contains ragged data -- if there are implicit header changes -- you can use ``--allow-ragged-csv-input`` (or keystroke-saver ``--ragged``). For too-short data lines, values are filled with empty string; for too-long data lines, missing field names are replaced with positional indices (counting up from 1, not 0), as follows:

::

    $ cat data/ragged.csv
    a,b,c
    1,2,3
    4,5
    6,7,8,9

::

    $ mlr --icsv --oxtab --allow-ragged-csv-input cat data/ragged.csv
    a 1
    b 2
    c 3
    
    a 4
    b 5
    c 
    
    a 6
    b 7
    c 8
    4 9

You may also find Miller's ``group-like`` feature handy (see also :doc:`reference`):

::

    $ mlr --ocsvlite group-like data/het.dkvp
    resource,loadsec,ok
    /path/to/file,0.45,true
    /path/to/second/file,0.32,true
    /some/other/path,0.97,false
    
    record_count,resource
    100,/path/to/file
    150,/path/to/second/file

::

    $ mlr --opprint group-like data/het.dkvp
    resource             loadsec ok
    /path/to/file        0.45    true
    /path/to/second/file 0.32    true
    /some/other/path     0.97    false
    
    record_count resource
    100          /path/to/file
    150          /path/to/second/file

Key-value-pair, vertical-tabular, and index-numbered formats
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For these formats, record-heterogeneity comes naturally:

::

    $ cat data/het.dkvp
    resource=/path/to/file,loadsec=0.45,ok=true
    record_count=100,resource=/path/to/file
    resource=/path/to/second/file,loadsec=0.32,ok=true
    record_count=150,resource=/path/to/second/file
    resource=/some/other/path,loadsec=0.97,ok=false

::

    $ mlr --onidx --ofs ' ' cat data/het.dkvp
    /path/to/file 0.45 true
    100 /path/to/file
    /path/to/second/file 0.32 true
    150 /path/to/second/file
    /some/other/path 0.97 false

::

    $ mlr --oxtab cat data/het.dkvp
    resource /path/to/file
    loadsec  0.45
    ok       true
    
    record_count 100
    resource     /path/to/file
    
    resource /path/to/second/file
    loadsec  0.32
    ok       true
    
    record_count 150
    resource     /path/to/second/file
    
    resource /some/other/path
    loadsec  0.97
    ok       false

::

    $ mlr --oxtab group-like data/het.dkvp
    resource /path/to/file
    loadsec  0.45
    ok       true
    
    resource /path/to/second/file
    loadsec  0.32
    ok       true
    
    resource /some/other/path
    loadsec  0.97
    ok       false
    
    record_count 100
    resource     /path/to/file
    
    record_count 150
    resource     /path/to/second/file

For processing
----------------------------------------------------------------

Miller operates on specified fields and takes the rest along: for example, if you are sorting on the ``count`` field then all records in the input stream must have a ``count`` field but the other fields can vary, and moreover the sorted-on field name(s) don't need to be in the same position on each line:

::

    $ cat data/sort-het.dkvp
    count=500,color=green
    count=600
    status=ok,count=250,hours=0.22
    status=ok,count=200,hours=3.4
    count=300,color=blue
    count=100,color=green
    count=450

::

    $ mlr sort -n count data/sort-het.dkvp
    count=100,color=green
    status=ok,count=200,hours=3.4
    status=ok,count=250,hours=0.22
    count=300,color=blue
    count=450
    count=500,color=green
    count=600