Skip to content

CSV, with and without headers

Headerless CSV on input or output

Sometimes we get CSV files which lack a header. For example, data/headerless.csv:

cat data/headerless.csv
John,23,present
Fred,34,present
Alice,56,missing
Carol,45,present

You can use Miller to add a header. The --implicit-csv-header applies positionally indexed labels:

mlr --csv --implicit-csv-header cat data/headerless.csv
1,2,3
John,23,present
Fred,34,present
Alice,56,missing
Carol,45,present

Following that, you can rename the positionally indexed labels to names with meaning for your context. For example:

mlr --csv --implicit-csv-header label name,age,status data/headerless.csv
name,age,status
John,23,present
Fred,34,present
Alice,56,missing
Carol,45,present

Likewise, if you need to produce CSV which is lacking its header, you can pipe Miller's output to the system command sed 1d, or you can use Miller's --headerless-csv-output option:

head -5 data/colored-shapes.dkvp | mlr --ocsv cat
color,shape,flag,i,u,v,w,x
yellow,triangle,1,56,0.632170,0.988721,0.436498,5.798188
red,square,1,80,0.219668,0.001257,0.792778,2.944117
red,circle,1,84,0.209017,0.290052,0.138103,5.065034
red,square,0,243,0.956274,0.746720,0.775542,7.117831
purple,triangle,0,257,0.435535,0.859129,0.812290,5.753095
head -5 data/colored-shapes.dkvp | mlr --ocsv --headerless-csv-output cat
yellow,triangle,1,56,0.632170,0.988721,0.436498,5.798188
red,square,1,80,0.219668,0.001257,0.792778,2.944117
red,circle,1,84,0.209017,0.290052,0.138103,5.065034
red,square,0,243,0.956274,0.746720,0.775542,7.117831
purple,triangle,0,257,0.435535,0.859129,0.812290,5.753095

Lastly, often we say "CSV" or "TSV" when we have positionally indexed data in columns which are separated by commas or tabs, respectively. In this case it's perhaps simpler to just use NIDX format which was designed for this purpose. (See also File Formats.) For example:

mlr --inidx --ifs comma --oxtab cut -f 1,3 data/headerless.csv
1 John
3 present

1 Fred
3 present

1 Alice
3 missing

1 Carol
3 present

Headerless CSV with duplicate field values

Miller's default treatment of CSV data, in the absence of any other adjectives, is always CSV with header. However, lots of folks think of CSV data -- comma-separated values -- as just that, values, with no header.

Here's some sample CSV data which is values-only, i.e. headerless:

cat data/nas.csv
-349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA
-338681.59578181,4537221.43295653,14,1,13.1,1,0.978,0.964,0.964
-334975.09404959,4537221.43295653,18,1,13.1,1,NA,NA,NA
-332195.21775042,4537221.43295653,21,1,13.1,1,0.978,0.974,0.96
-331268.59231736,4537221.43295653,22,1,13.1,1,0.978,0.978,0.962
-330341.96688431,4537221.43295653,23,1,13.1,1,0.978,0.978,0.962
-326635.46515209,4537221.43295653,27,1,13.1,2,0.978,0.972,0.958

There are clearly nine fields here, but if we try to have Miller parse it as CSV, we see there are fewer than nine columns:

mlr --csv cat data/nas.csv
-349801.10097848,4537221.43295653,2,1,NA
-338681.59578181,4537221.43295653,14,1,0.964
-334975.09404959,4537221.43295653,18,1,NA
-332195.21775042,4537221.43295653,21,1,0.96
-331268.59231736,4537221.43295653,22,1,0.962
-330341.96688431,4537221.43295653,23,1,0.962
-326635.46515209,4537221.43295653,27,1,0.958

What happened?

Miller is (by central design) a mapping from name to value, rather than integer position to value as in most tools in the Unix toolkit such as sort, cut, awk, etc. So given input Yea=1,Yea=2 on the same input line, first Yea=1 is stored, then updated with Yea=2. This is in the input-parser and the value Yea=1 is unavailable to any further processing.

Here, the first data line is being seen as a header ine, and the repeated NA values are being seen as duplicate keys.

One solution is to use --implicit-csv-header, or its shorter alias --hi:

mlr --csv --hi cat data/nas.csv
1,2,3,4,5,6,7,8,9
-349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA
-338681.59578181,4537221.43295653,14,1,13.1,1,0.978,0.964,0.964
-334975.09404959,4537221.43295653,18,1,13.1,1,NA,NA,NA
-332195.21775042,4537221.43295653,21,1,13.1,1,0.978,0.974,0.96
-331268.59231736,4537221.43295653,22,1,13.1,1,0.978,0.978,0.962
-330341.96688431,4537221.43295653,23,1,13.1,1,0.978,0.978,0.962
-326635.46515209,4537221.43295653,27,1,13.1,2,0.978,0.972,0.958

Another solution is to use NIDX format:

mlr --inidx --ifs comma --ocsv cat data/nas.csv
1,2,3,4,5,6,7,8,9
-349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA
-338681.59578181,4537221.43295653,14,1,13.1,1,0.978,0.964,0.964
-334975.09404959,4537221.43295653,18,1,13.1,1,NA,NA,NA
-332195.21775042,4537221.43295653,21,1,13.1,1,0.978,0.974,0.96
-331268.59231736,4537221.43295653,22,1,13.1,1,0.978,0.978,0.962
-330341.96688431,4537221.43295653,23,1,13.1,1,0.978,0.978,0.962
-326635.46515209,4537221.43295653,27,1,13.1,2,0.978,0.972,0.958

Either way, since there is no explicit header, fields are named 1 through 9. We can use the label verb to apply more meaningful namees:

mlr --csv --hi cat then label xsn,ysn,x,y,t,a,e29,e31,e32 data/nas.csv
xsn,ysn,x,y,t,a,e29,e31,e32
-349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA
-338681.59578181,4537221.43295653,14,1,13.1,1,0.978,0.964,0.964
-334975.09404959,4537221.43295653,18,1,13.1,1,NA,NA,NA
-332195.21775042,4537221.43295653,21,1,13.1,1,0.978,0.974,0.96
-331268.59231736,4537221.43295653,22,1,13.1,1,0.978,0.978,0.962
-330341.96688431,4537221.43295653,23,1,13.1,1,0.978,0.978,0.962
-326635.46515209,4537221.43295653,27,1,13.1,2,0.978,0.972,0.958
mlr --inidx --ifs comma --ocsv cat then label xsn,ysn,x,y,t,a,e29,e31,e32 data/nas.csv
xsn,ysn,x,y,t,a,e29,e31,e32
-349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA
-338681.59578181,4537221.43295653,14,1,13.1,1,0.978,0.964,0.964
-334975.09404959,4537221.43295653,18,1,13.1,1,NA,NA,NA
-332195.21775042,4537221.43295653,21,1,13.1,1,0.978,0.974,0.96
-331268.59231736,4537221.43295653,22,1,13.1,1,0.978,0.978,0.962
-330341.96688431,4537221.43295653,23,1,13.1,1,0.978,0.978,0.962
-326635.46515209,4537221.43295653,27,1,13.1,2,0.978,0.972,0.958

Regularizing ragged CSV

Miller handles RFC-4180-compliant CSV: in particular, it's an error if the number of data fields in a given data line don't match the number of header lines. But in the event that you have a CSV file in which some lines have less than the full number of fields, you can use Miller to pad them out. The trick is to use NIDX format, for which each line stands on its own without respect to a header line.

cat data/ragged.csv
a,b,c
1,2,3
4,5
6,7,8,9
mlr --from data/ragged.csv --fs comma --nidx put '
  @maxnf = max(@maxnf, NF);
  @nf = NF;
  while(@nf < @maxnf) {
    @nf += 1;
    $[@nf] = ""
  }
'
a,b,c
1,2,3
4,5
6,7,8,9

or, more simply,

mlr --from data/ragged.csv --fs comma --nidx put '
  @maxnf = max(@maxnf, NF);
  while(NF < @maxnf) {
    $[NF+1] = "";
  }
'
a,b,c
1,2,3
4,5
6,7,8,9

See also the record-heterogeneity page.