Skip to content
Quick links:   Flags   Verbs   Functions   Glossary   Release docs

Questions about joins

Why am I not seeing all possible joins occur?

This section describes behavior before Miller 5.1.0. As of 5.1.0, -u is the default.

For example, the right file here has nine records, and the left file should add in the hostname column -- so the join output should also have 9 records:

mlr --icsvlite --opprint cat data/join-u-left.csv
hostname              ipaddr
nadir.east.our.org    10.3.1.18
zenith.west.our.org   10.3.1.27
apoapsis.east.our.org 10.4.5.94
mlr --icsvlite --opprint cat data/join-u-right.csv
ipaddr    timestamp  bytes
10.3.1.27 1448762579 4568
10.3.1.18 1448762578 8729
10.4.5.94 1448762579 17445
10.3.1.27 1448762589 12
10.3.1.18 1448762588 44558
10.4.5.94 1448762589 8899
10.3.1.27 1448762599 0
10.3.1.18 1448762598 73425
10.4.5.94 1448762599 12200
mlr --icsvlite --opprint join -s -j ipaddr -f data/join-u-left.csv data/join-u-right.csv
ipaddr    hostname              timestamp  bytes
10.3.1.27 zenith.west.our.org   1448762579 4568
10.4.5.94 apoapsis.east.our.org 1448762579 17445
10.4.5.94 apoapsis.east.our.org 1448762589 8899
10.4.5.94 apoapsis.east.our.org 1448762599 12200

The issue is that Miller's join, by default (before 5.1.0), took input sorted (lexically ascending) by the sort keys on both the left and right files. This design decision was made intentionally to parallel the Unix/Linux system join command, which has the same semantics. The benefit of this default is that the joiner program can stream through the left and right files, needing to load neither entirely into memory. The drawback, of course, is that is requires sorted input.

The solution (besides pre-sorting the input files on the join keys) is to simply use mlr join -u (which is now the default). This loads the left file entirely into memory (while the right file is still streamed one line at a time) and does all possible joins without requiring sorted input:

mlr --icsvlite --opprint join -u -j ipaddr -f data/join-u-left.csv data/join-u-right.csv
ipaddr    hostname              timestamp  bytes
10.3.1.27 zenith.west.our.org   1448762579 4568
10.3.1.18 nadir.east.our.org    1448762578 8729
10.4.5.94 apoapsis.east.our.org 1448762579 17445
10.3.1.27 zenith.west.our.org   1448762589 12
10.3.1.18 nadir.east.our.org    1448762588 44558
10.4.5.94 apoapsis.east.our.org 1448762589 8899
10.3.1.27 zenith.west.our.org   1448762599 0
10.3.1.18 nadir.east.our.org    1448762598 73425
10.4.5.94 apoapsis.east.our.org 1448762599 12200

General advice is to make sure the left-file is relatively small, e.g. containing name-to-number mappings, while saving large amounts of data for the right file.

How to rectangularize after joins with unpaired?

Suppose you have the following two data files:

id,code
3,0000ff
2,00ff00
4,ff0000
id,color
4,red
2,green

Joining on color the results are as expected:

mlr --csv join -j id -f data/color-codes.csv data/color-names.csv
id,code,color
4,ff0000,red
2,00ff00,green

However, if we ask for left-unpaireds, since there's no color column, we get a row not having the same column names as the other:

mlr --csv join --ul -j id -f data/color-codes.csv data/color-names.csv
id,code,color
4,ff0000,red
2,00ff00,green

id,code
3,0000ff

To fix this, we can use unsparsify:

mlr --csv join --ul -j id -f data/color-codes.csv \
  then unsparsify --fill-with "" \
  data/color-names.csv
id,code,color
4,ff0000,red
2,00ff00,green
3,0000ff,

Thanks to @aborruso for the tip!

See also the record-heterogeneity page.

Doing multiple joins

Suppose we have the following data:

cat multi-join/input.csv
id,task
10,chop
20,puree
20,wash
30,fold
10,bake
20,mix
10,knead
30,clean

And we want to augment the id column with lookups from the following data files:

cat multi-join/name-lookup.csv
id,name
30,Alice
10,Bob
20,Carol
cat multi-join/status-lookup.csv
id,status
30,occupied
10,idle
20,idle

We can run the input file through multiple join commands in a then-chain:

mlr --icsv --opprint join -f multi-join/name-lookup.csv -j id \
  then join -f multi-join/status-lookup.csv -j id \
  multi-join/input.csv
id status   name  task
10 idle     Bob   chop
20 idle     Carol puree
20 idle     Carol wash
30 occupied Alice fold
10 idle     Bob   bake
20 idle     Carol mix
10 idle     Bob   knead
30 occupied Alice clean
Back to top