The comm
command can be used to compare two files line by line. It’s particularly useful when writing shell scripts. Take for example the following two files:
$ cat file1 a b c $ cat file2 b c d e
You can quickly see which lines are common to the two files and which are present in only one:
$ comm file1 file2 a b c d e
The first column lists the lines present only in the first file, the second column those present only in the second file, and the third shows the lines that are identical in both files.
Keeping Things In Order
Before we delve further it’s important to note that one of comm
‘s restrictions is that the input files must be sorted. That is easily rectified using sort
(without any extra options). For example:
$ # First, randomise file2 and save as file3: $ shuf file2 | tee file3 d e b c $ comm file1 file3 a b c d e comm: file 2 is not in sorted order b c $ # Now sort file3 before using it: $ comm file1 <(sort file3) a b c d e
When the input files are not sorted the output of comm
is not defined and it will exit with an error.
Using comm
In Scripts
The columns output by comm
are delimited by single TAB
characters, so scripts can reasonably easily parse comm
's output to glean the information they need. Sometimes you only need what's in one of the columns, though, and nobody wants to reach for cut
or even awk
without good cause. Thankfully comm
can be told to omit columns from its output entirely.
To display only the lines unique to file1
, use -23
to exclude the second and third columns:
$ comm -23 file1 file2 a
To display only the lines unique to file2
, exclude the first and third columns:
$ comm -13 file1 file2 d e
And finally, to display only the lines common to both files, include only the third column:
$ comm -12 file1 file2 b c
comm
is part of GNU coreutils and should be available out of the box on most Linux systems. More options are available, such as --total
to calculate a summary of the number of lines in each column, or --zero-terminated
(-z
) which is useful when dealing with file names that can contain spaces (together with find -print0
and xargs -0
for example); be sure to check the comm(1) man page as well as the online documentation to get the full picture.
Photo by Paul Gilmore on Unsplash