Compressed data¶
As of Miller 6, Miller supports reading GZIP, BZIP2, and
ZLIB formats transparently, and in-process. And (as before Miller 6) you have a
more general --prepipe
option to support other decompression programs.
Automatic detection on input¶
If your files end in .gz
, .bz2
, or .z
then Miller will autodetect by file extension:
file gz-example.csv.gz
gz-example.csv.gz: gzip compressed data, was "gz-example.csv", last modified: Mon Aug 23 02:04:34 2021, from Unix, original size modulo 2^32 429
mlr --csv sort -f color gz-example.csv.gz
color,shape,flag,k,index,quantity,rate purple,triangle,false,5,51,81.2290,8.5910 purple,triangle,false,7,65,80.1405,5.8240 purple,square,false,10,91,72.3735,8.2430 red,square,true,2,15,79.2778,0.0130 red,circle,true,3,16,13.8103,2.9010 red,square,false,4,48,77.5542,7.4670 red,square,false,6,64,77.1991,9.5310 yellow,triangle,true,1,11,43.6498,9.8870 yellow,circle,true,8,73,63.9785,4.2370 yellow,circle,true,9,87,63.5058,8.3350
This will decompress the input data on the fly, while leaving the disk file unmodified. This helps you save disk space, at the cost of some additional runtime CPU usage to decompress the data.
Manual detection on input¶
If the filename doesn't in in .gz
, .bz2
, or .z
then you can use the flags --gzin
, --bz2in
, or --zin
to let Miller know:
mlr --csv --gzin sort -f color myfile.bin # myfile.bin has gzip contents
External decompressors on input¶
Using the --prepipe
flag, you can provide the name of any decompression
program in your $PATH
and Miller will run it on each input file, effectively
piping the standard output of that program to Miller's standard input.
You can, of course, already do without this for single input files, for example:
gunzip < gz-example.csv.gz | mlr --csv sort -f color
color,shape,flag,k,index,quantity,rate purple,triangle,false,5,51,81.2290,8.5910 purple,triangle,false,7,65,80.1405,5.8240 purple,square,false,10,91,72.3735,8.2430 red,square,true,2,15,79.2778,0.0130 red,circle,true,3,16,13.8103,2.9010 red,square,false,4,48,77.5542,7.4670 red,square,false,6,64,77.1991,9.5310 yellow,triangle,true,1,11,43.6498,9.8870 yellow,circle,true,8,73,63.9785,4.2370 yellow,circle,true,9,87,63.5058,8.3350
The benefit of --prepipe
is that Miller will run the specified program once per
file, respecting file boundaries.
The prepipe command can be anything which reads from standard input and produces data acceptable to Miller. Nominally this allows you to use whichever decompression utilities you have installed on your system, on a per-file basis.
If the command has flags, quote them: e.g. mlr --prepipe 'zcat -cf'
.
In your .mlrrc file, --prepipe
and --prepipex
are not
allowed as they could be used for unexpected code execution. You can use
--prepipe-bz2
, --prepipe-gunzip
, and --prepipe-zcat
in .mlrrc
, though.
Note that this feature is quite general and is not limited to decompression
utilities. You can use it to apply per-file filters of your choice: e.g. mlr
--prepipe 'head -n 10' ...
, if you like.
There is a --prepipe
and a --prepipex
:
- If the command normally runs with
nameofprogram < filename.ext
(such asgunzip
orzcat -cf
orxz -cd
) then use--prepipe
. - If the command normally runs with
nameofprogram filename.ext
(such asunzip -qc
) then use--prepipex
.
Lastly, note that if --prepipe
or --prepipex
is specified on the Miller
command line, it replaces any autodetect decisions that might have been made
based on the filename extension. Likewise, --gzin
/--bz2in
/--zin
are ignored if
--prepipe
or --prepipex
is also specified.
Compressed output¶
Everything said so far on this page has to do with compressed input.
For compressed output:
-
Normally Miller output is to stdout, so you can pipe the output:
mlr sort -n quantity foo.csv | gzip > sorted.csv.gz
. -
For
tee
statements, which write output to files rather than stdout, usetee
's redirect syntax:
mlr --from example.csv --csv put -q ' filename = $color.".csv.gz"; tee | "gzip > ".filename, $* '
file red.csv.gz purple.csv.gz yellow.csv.gz
red.csv.gz: gzip compressed data, last modified: Mon Aug 23 02:34:05 2021, from Unix, original size modulo 2^32 185 purple.csv.gz: gzip compressed data, last modified: Mon Aug 23 02:34:05 2021, from Unix, original size modulo 2^32 164 yellow.csv.gz: gzip compressed data, last modified: Mon Aug 23 02:34:05 2021, from Unix, original size modulo 2^32 158
mlr --csv cat yellow.csv.gz
color,shape,flag,k,index,quantity,rate yellow,triangle,true,1,11,43.6498,9.8870 yellow,circle,true,8,73,63.9785,4.2370 yellow,circle,true,9,87,63.5058,8.3350
- Using the in-place flag
-I
, the overwritten file will be compressed when possible. See the page on in-place mode for details.