join

The join command helps you to combine lines from two files based on a common field. This works best when the input is already sorted by that field.

Default join

By default, join combines two files based on the first field content (also referred as key). Only the lines with common keys will be part of the output.

The key field will be displayed first in the output (this distinction will come into play if the first field isn't the key). Rest of the line will have the remaining fields from the first and second files, in that order. One or more blanks (space or tab) will be considered as the input field separator and a single space will be used as the output field separator. If present, blank characters at the start of the input lines will be ignored.

# sample sorted input files $ cat shopping_jan.txt apple   10 banana  20 soap    3 tshirt  3 $ cat shopping_feb.txt banana  15 fig     100 pen     2 soap    1  # combine common lines based on the first field $ join shopping_jan.txt shopping_feb.txt banana 20 15 soap 3 1

If a field value is present multiple times in the same input file, all possible combinations will be present in the output. As shown below, join will also ensure to add a final newline character even if it wasn't present in the input.

$ join <(printf 'a f1_x\na f1_y') <(printf 'a f2_x\na f2_y') a f1_x f2_x a f1_x f2_y a f1_y f2_x a f1_y f2_y

Note that the collating order used for join should be same as the one used to sort the input files. Use join -i to ignore case, similar to sort -f usage.

If the input files are not sorted, join will produce an error if there are unpairable lines. You can use the --nocheck-order option to ignore this error. However, as per the documentation, this option "is not guaranteed to produce any particular output."

Non-matching lines

By default, only the lines having common keys are part of the output. You can use the -a option to also include the non-matching lines from the input files. Use 1 and 2 as the argument for the first and second file respectively. You'll later see how to fill missing fields with a custom string.

# includes non-matching lines from the first file $ join -a1 shopping_jan.txt shopping_feb.txt apple 10 banana 20 15 soap 3 1 tshirt 3  # includes non-matching lines from both the files $ join -a1 -a2 shopping_jan.txt shopping_feb.txt apple 10 banana 20 15 fig 100 pen 2 soap 3 1 tshirt 3

If you use -v instead of -a, the output will have only the non-matching lines.

$ join -v2 shopping_jan.txt shopping_feb.txt fig 100 pen 2  $ join -v1 -v2 shopping_jan.txt shopping_feb.txt apple 10 fig 100 pen 2 tshirt 3

Change field separator

You can use the -t option to specify a single byte character as the field separator. The output field separator will be same as the value used for the -t option. Use \0 to specify NUL as the separator. Empty string will cause entire input line content to be considered as keys. Depending on your shell you can use ANSI-C quoting to use escapes like \t instead of a literal tab character.

$ cat marks.csv ECE,Raj,53 ECE,Joel,72 EEE,Moi,68 CSE,Surya,81 EEE,Raj,88 CSE,Moi,62 EEE,Tia,72 ECE,Om,92 CSE,Amy,67 $ cat dept.txt CSE ECE  # get all lines from marks.csv based on the first field keys in dept.txt $ join -t, <(sort marks.csv) dept.txt CSE,Amy,67 CSE,Moi,62 CSE,Surya,81 ECE,Joel,72 ECE,Om,92 ECE,Raj,53

Files with headers

Use the --header option to ignore first lines of both the input files from sorting consideration. Without this option, the join command might still work correctly if unpairable lines aren't found, but it is preferable to use --header when applicable. This option will also help when --check-order option is active.

$ cat report_1.csv Name,Maths,Physics Amy,78,95 Moi,88,75 Raj,67,76 $ cat report_2.csv Name,Chemistry Amy,85 Joel,78 Raj,72  $ join --check-order -t, report_1.csv report_2.csv join: report_1.csv:2: is not sorted: Amy,78,95 $ join --check-order --header -t, report_1.csv report_2.csv Name,Maths,Physics,Chemistry Amy,78,95,85 Raj,67,76,72

Change key field

By default, the first field of both the input files are used to combine the lines. You can use -1 and -2 options followed by a field number to specify a different field number. You can use the -j option if the field number is the same for both the files.

Recall that the key field is the first field in the output. You'll later see how to customize the output field order.

$ cat names.txt Amy Raj Tia  # combine based on the second field of the first file # and the first field of the second file (default) $ join -t, -1 2 <(sort -t, -k2,2 marks.csv) names.txt Amy,CSE,67 Raj,ECE,53 Raj,EEE,88 Tia,EEE,72

Customize output field list

Use the -o option to customize the fields required in the output and their order. Especially useful when the first field isn't the key. Each output field is specified as file number followed by a . character and then the field number. You can specify multiple fields separated by a , character. As a special case, you can use 0 to indicate the key field.

# output field order is 1st, 2nd and 3rd fields from the first file $ join -t, -1 2 -o 1.1,1.2,1.3 <(sort -t, -k2,2 marks.csv) names.txt CSE,Amy,67 ECE,Raj,53 EEE,Raj,88 EEE,Tia,72  # 1st field from the first file, 2nd field from the second file # and then 2nd and 3rd fields from the first file $ join --header -t, -o 1.1,2.2,1.2,1.3 report_1.csv report_2.csv Name,Chemistry,Maths,Physics Amy,85,78,95 Raj,72,67,76

Same number of output fields

If you use auto as the argument for the -o option, first line of both the input files will be used to determine the number of output fields. If the other lines have extra fields, they will be discarded.

$ join <(printf 'a 1 2\nb p q r') <(printf 'a 3 4\nb x y z') a 1 2 3 4 b p q r x y z  $ join -o auto <(printf 'a 1 2\nb p q r') <(printf 'a 3 4\nb x y z') a 1 2 3 4 b p q x y

If the other lines have lesser number of fields, the -e option will determine the string to be used as a filler (empty string is the default).

# the second line has two empty fields $ join -o auto <(printf 'a 1 2\nb p') <(printf 'a 3 4\nb x') a 1 2 3 4 b p  x   $ join -o auto -e '-' <(printf 'a 1 2\nb p') <(printf 'a 3 4\nb x') a 1 2 3 4 b p - x -

As promised earlier, here are some examples of filling fields for non-matching lines:

$ join -o auto -a1 -e 'NA' shopping_jan.txt shopping_feb.txt apple 10 NA banana 20 15 soap 3 1 tshirt 3 NA  $ join -o auto -a1 -a2 -e 'NA' shopping_jan.txt shopping_feb.txt apple 10 NA banana 20 15 fig NA 100 pen NA 2 soap 3 1 tshirt 3 NA

Set operations

This section covers whole line set operations you can perform on already sorted input files. Equivalent sort and uniq solutions will also be mentioned as comments (useful for unsorted inputs). Assume that there are no duplicate lines within an input file.

These two sorted input files will be used for the examples to follow:

$ paste colors_1.txt colors_2.txt Blue    Black Brown   Blue Orange  Green Purple  Orange Red     Pink Teal    Red White   White

Here's how you can get union and symmetric difference results. Recall that -t '' will cause the entire input line content to be considered as keys.

# union # unsorted input: sort -u colors_1.txt colors_2.txt $ join -t '' -a1 -a2 colors_1.txt colors_2.txt Black Blue Brown Green Orange Pink Purple Red Teal White  # symmetric difference # unsorted input: sort colors_1.txt colors_2.txt | uniq -u $ join -t '' -v1 -v2 colors_1.txt colors_2.txt Black Brown Green Pink Purple Teal

Here's how you can get intersection and difference results. The equivalent comm solutions for sorted input is also mentioned in the comments.

# intersection, same as: comm -12 colors_1.txt colors_2.txt # unsorted input: sort colors_1.txt colors_2.txt | uniq -d $ join -t '' colors_1.txt colors_2.txt Blue Orange Red White  # difference, same as: comm -13 colors_1.txt colors_2.txt # unsorted input: sort colors_1.txt colors_1.txt colors_2.txt | uniq -u $ join -t '' -v2 colors_1.txt colors_2.txt Black Green Pink  # difference, same as: comm -23 colors_1.txt colors_2.txt # unsorted input: sort colors_1.txt colors_2.txt colors_2.txt | uniq -u $ join -t '' -v1 colors_1.txt colors_2.txt Brown Purple Teal

As mentioned before, join will display all the combinations if there are duplicate entries. Here's an example to show the differences between sort, comm and join solutions for displaying common lines:

$ paste list_1.txt list_2.txt apple   cherry banana  cherry cherry  mango cherry  papaya cherry   cherry    # only one entry per common line $ sort list_1.txt list_2.txt | uniq -d cherry  # minimum of 'no. of entries in file1' and 'no. of entries in file2' $ comm -12 list_1.txt list_2.txt cherry cherry  # 'no. of entries in file1' multiplied by 'no. of entries in file2' $ join -t '' list_1.txt list_2.txt cherry cherry cherry cherry cherry cherry cherry cherry

NUL separator

Use the -z option if you want to use NUL character as the line separator. In this scenario, join will ensure to add a final NUL character even if not present in the input.

$ join -z <(printf 'a 1\0b x') <(printf 'a 2\0b y') | cat -v a 1 2^@b x y^@

Alternatives

Here are some alternate commands you can explore if join isn't enough to solve your task. These alternatives do not require input to be sorted.

zet — set operations on one or more input files
Comparing lines between files section from my GNU grep ebook
Two file processing chapter from my GNU awk ebook, has examples for both line and field based comparisons
Two file processing chapter from my Perl one-liners ebook, has examples for both line and field based comparisons

Exercises

The exercises directory has all the files used in this section.

Assume that the input files are already sorted for these exercises.

1) Use appropriate options to get the expected outputs shown below.

# no output $ join <(printf 'apple 2\nfig 5') <(printf 'Fig 10\nmango 4')  # expected output 1 ##### add your solution here fig 5 10  # expected output 2 ##### add your solution here apple 2 fig 5 10 mango 4

2) Use the join command to display only the non-matching lines based on the first field.

$ cat j1.txt apple   2 fig     5 lemon   10 tomato  22 $ cat j2.txt almond  33 fig     115 mango   20 pista   42  # first field items present in j1.txt but not j2.txt ##### add your solution here apple 2 lemon 10 tomato 22  # first field items present in j2.txt but not j1.txt ##### add your solution here almond 33 mango 20 pista 42

3) Filter lines from j1.txt and j2.txt that match the items from s1.txt.

$ cat s1.txt apple coffee fig honey mango pasta sugar tea  ##### add your solution here apple 2 fig 115 fig 5 mango 20

4) Join the marks_1.csv and marks_2.csv files to get the expected output shown below.

$ cat marks_1.csv Name,Biology,Programming Er,92,77 Ith,100,100 Lin,92,100 Sil,86,98 $ cat marks_2.csv Name,Maths,Physics,Chemistry Cy,97,98,95 Ith,100,100,100 Lin,78,83,80  ##### add your solution here Name,Biology,Programming,Maths,Physics,Chemistry Ith,100,100,100,100,100 Lin,92,100,78,83,80

5) By default, the first field is used to combine the lines. Which options are helpful if you want to change the key field to be used for joining?

6) Join the marks_1.csv and marks_2.csv files to get the expected output with specific fields as shown below.

##### add your solution here Name,Programming,Maths,Biology Ith,100,100,100 Lin,100,78,92

7) Join the marks_1.csv and marks_2.csv files to get the expected output shown below. Use 50 as the filler data.

##### add your solution here Name,Biology,Programming,Maths,Physics,Chemistry Cy,50,50,97,98,95 Er,92,77,50,50,50 Ith,100,100,100,100,100 Lin,92,100,78,83,80 Sil,86,98,50,50,50

8) When you use the -o auto option, what'd happen to the extra fields compared to those in the first lines of the input data?

9) From the input files j3.txt and j4.txt, filter only the lines are unique — i.e. lines that are not common to these files. Assume that the input files do not have duplicate entries.

$ cat j3.txt almond apple pie cold coffee honey mango shake pasta sugar tea $ cat j4.txt apple banana shake coffee fig honey mango shake milk tea yeast  ##### add your solution here almond apple apple pie banana shake coffee cold coffee fig milk pasta sugar yeast

10) From the input files j3.txt and j4.txt, filter only the lines are common to these files.

##### add your solution here honey mango shake tea

CLI text processing with GNU Coreutils