CSV, with and without headers¶
Headerless CSV on input or output¶
Sometimes we get CSV files which lack a header. For example, data/headerless.csv:
cat data/headerless.csv
John,23,present Fred,34,present Alice,56,missing Carol,45,present
You can use Miller to add a header. The --implicit-csv-header
applies positionally indexed labels:
mlr --csv --implicit-csv-header cat data/headerless.csv
1,2,3 John,23,present Fred,34,present Alice,56,missing Carol,45,present
Following that, you can rename the positionally indexed labels to names with meaning for your context. For example:
mlr --csv --implicit-csv-header label name,age,status data/headerless.csv
name,age,status John,23,present Fred,34,present Alice,56,missing Carol,45,present
Likewise, if you need to produce CSV which is lacking its header, you can pipe Miller's output to the system command sed 1d
, or you can use Miller's --headerless-csv-output
option:
head -5 data/colored-shapes.dkvp | mlr --ocsv cat
color,shape,flag,i,u,v,w,x yellow,triangle,1,56,0.632170,0.988721,0.436498,5.798188 red,square,1,80,0.219668,0.001257,0.792778,2.944117 red,circle,1,84,0.209017,0.290052,0.138103,5.065034 red,square,0,243,0.956274,0.746720,0.775542,7.117831 purple,triangle,0,257,0.435535,0.859129,0.812290,5.753095
head -5 data/colored-shapes.dkvp | mlr --ocsv --headerless-csv-output cat
yellow,triangle,1,56,0.632170,0.988721,0.436498,5.798188 red,square,1,80,0.219668,0.001257,0.792778,2.944117 red,circle,1,84,0.209017,0.290052,0.138103,5.065034 red,square,0,243,0.956274,0.746720,0.775542,7.117831 purple,triangle,0,257,0.435535,0.859129,0.812290,5.753095
Lastly, often we say "CSV" or "TSV" when we have positionally indexed data in columns which are separated by commas or tabs, respectively. In this case it's perhaps simpler to just use NIDX format which was designed for this purpose. (See also File Formats.) For example:
mlr --inidx --ifs comma --oxtab cut -f 1,3 data/headerless.csv
1 John 3 present 1 Fred 3 present 1 Alice 3 missing 1 Carol 3 present
Headerless CSV with duplicate field values¶
Miller's default treatment of CSV data, in the absence of any other adjectives, is always CSV with header. However, lots of folks think of CSV data -- comma-separated values -- as just that, values, with no header.
Here's some sample CSV data which is values-only, i.e. headerless:
cat data/nas.csv
-349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA -338681.59578181,4537221.43295653,14,1,13.1,1,0.978,0.964,0.964 -334975.09404959,4537221.43295653,18,1,13.1,1,NA,NA,NA -332195.21775042,4537221.43295653,21,1,13.1,1,0.978,0.974,0.96 -331268.59231736,4537221.43295653,22,1,13.1,1,0.978,0.978,0.962 -330341.96688431,4537221.43295653,23,1,13.1,1,0.978,0.978,0.962 -326635.46515209,4537221.43295653,27,1,13.1,2,0.978,0.972,0.958
There are clearly nine fields here, but if we try to have Miller parse it as CSV, we see something happened:
mlr --csv cat data/nas.csv
-349801.10097848,4537221.43295653,2,1,NA,NA_2,NA_3,NA_4,NA_5 -338681.59578181,4537221.43295653,14,1,13.1,1,0.978,0.964,0.964 -334975.09404959,4537221.43295653,18,1,13.1,1,NA,NA,NA -332195.21775042,4537221.43295653,21,1,13.1,1,0.978,0.974,0.96 -331268.59231736,4537221.43295653,22,1,13.1,1,0.978,0.978,0.962 -330341.96688431,4537221.43295653,23,1,13.1,1,0.978,0.978,0.962 -326635.46515209,4537221.43295653,27,1,13.1,2,0.978,0.972,0.958
What happened?
Miller is (by central design) a mapping from name to value, rather than integer
position to value as in most tools in the Unix toolkit such as sort
, cut
,
awk
, etc. And its default behavior with repeated column/field names is to
append _2
, _3
, etc to dedupe them. So given input Yea=1,Yea=2
on the
same input line, first Yea=1
is stored, then Yea_2=2
. This is in the
input-parser.
Here, the first data line is being seen as a header line, and the repeated NA
values are being seen as duplicate keys that need to be deduplicated.
One solution is to use --implicit-csv-header
, or its shorter alias --hi
:
mlr --csv --hi cat data/nas.csv
1,2,3,4,5,6,7,8,9 -349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA -338681.59578181,4537221.43295653,14,1,13.1,1,0.978,0.964,0.964 -334975.09404959,4537221.43295653,18,1,13.1,1,NA,NA,NA -332195.21775042,4537221.43295653,21,1,13.1,1,0.978,0.974,0.96 -331268.59231736,4537221.43295653,22,1,13.1,1,0.978,0.978,0.962 -330341.96688431,4537221.43295653,23,1,13.1,1,0.978,0.978,0.962 -326635.46515209,4537221.43295653,27,1,13.1,2,0.978,0.972,0.958
Another solution is to use NIDX format:
mlr --inidx --ifs comma --ocsv cat data/nas.csv
1,2,3,4,5,6,7,8,9 -349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA -338681.59578181,4537221.43295653,14,1,13.1,1,0.978,0.964,0.964 -334975.09404959,4537221.43295653,18,1,13.1,1,NA,NA,NA -332195.21775042,4537221.43295653,21,1,13.1,1,0.978,0.974,0.96 -331268.59231736,4537221.43295653,22,1,13.1,1,0.978,0.978,0.962 -330341.96688431,4537221.43295653,23,1,13.1,1,0.978,0.978,0.962 -326635.46515209,4537221.43295653,27,1,13.1,2,0.978,0.972,0.958
Either way, since there is no explicit header, fields are named 1
through 9
. We can use the
label verb to apply more meaningful namees:
mlr --csv --hi cat then label xsn,ysn,x,y,t,a,e29,e31,e32 data/nas.csv
xsn,ysn,x,y,t,a,e29,e31,e32 -349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA -338681.59578181,4537221.43295653,14,1,13.1,1,0.978,0.964,0.964 -334975.09404959,4537221.43295653,18,1,13.1,1,NA,NA,NA -332195.21775042,4537221.43295653,21,1,13.1,1,0.978,0.974,0.96 -331268.59231736,4537221.43295653,22,1,13.1,1,0.978,0.978,0.962 -330341.96688431,4537221.43295653,23,1,13.1,1,0.978,0.978,0.962 -326635.46515209,4537221.43295653,27,1,13.1,2,0.978,0.972,0.958
mlr --inidx --ifs comma --ocsv cat then label xsn,ysn,x,y,t,a,e29,e31,e32 data/nas.csv
xsn,ysn,x,y,t,a,e29,e31,e32 -349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA -338681.59578181,4537221.43295653,14,1,13.1,1,0.978,0.964,0.964 -334975.09404959,4537221.43295653,18,1,13.1,1,NA,NA,NA -332195.21775042,4537221.43295653,21,1,13.1,1,0.978,0.974,0.96 -331268.59231736,4537221.43295653,22,1,13.1,1,0.978,0.978,0.962 -330341.96688431,4537221.43295653,23,1,13.1,1,0.978,0.978,0.962 -326635.46515209,4537221.43295653,27,1,13.1,2,0.978,0.972,0.958
Regularizing ragged CSV¶
Miller handles RFC-4180-compliant CSV: in particular, it's an error if the number of data fields in a given data line don't match the number of header lines. But in the event that you have a CSV file in which some lines have less than the full number of fields, you can use Miller to pad them out. The trick is to use NIDX format, for which each line stands on its own without respect to a header line.
cat data/ragged.csv
a,b,c 1,2,3 4,5 6,7,8,9
mlr --from data/ragged.csv --fs comma --nidx put ' @maxnf = max(@maxnf, NF); @nf = NF; while(@nf < @maxnf) { @nf += 1; $[@nf] = "" } '
a,b,c 1,2,3 4,5, 6,7,8,9
or, more simply,
mlr --from data/ragged.csv --fs comma --nidx put ' @maxnf = max(@maxnf, NF); while(NF < @maxnf) { $[NF+1] = ""; } '
a,b,c 1,2,3 4,5, 6,7,8,9
See also the record-heterogeneity page.