Finding Lines in One File But Not in the Other

Problem

You have two data files and you need to compare them and find lines that exist in one file but not in the other.

Solution

Sort the files and isolate the data of interest using cut or awk if necessary, and then use comm, diff, grep, or uniq depending on your needs.

comm is designed for just this type of problem:

$ cat left
record_01
record_02.left only
record_03
record_05.differ
record_06
record_07
record_08
record_09
record_10

$ cat right
record_01
record_02
record_04
record_05
record_06.differ
record_07
record_08
record_09.right only
record_10
# Only show lines in the left file
$ comm -23 left right
record_02.left only
record_03
record_05.differ
record_06
record_09

# Only show lines in the right file
$ comm -13 left right
record_02
record_04
record_05
record_06.differ
record_09.right only

# Only show lines common to both files
$ comm -12 left right
record_01
record_07
record_08
record_10

diff will quickly show you all the differences from both files, but its output is not terribly pretty and you may not need to know all the differences. GNU grep’s -y and -w options can be handy for readability, but you can get used to the regular output as well. Some systems (e.g., Solaris) may use sdiff instead of diff-y or have a separate binary such as bdiff to process very large files.

$ diff -y -W 60 left right
record_01                       record_01
record_02.left only          |  record_02
record_03                    |  record_04
record_05.differ             |  record_05
record_06                    |  record_06.differ
record_07                       record_07
record_08                       record_08
record_09                    |  record_09.right only
record_10                       record_10

$ diff -y -W 60 --suppress-common-lines left right
record_02.left only          |  record_02
record_03                    |  record_04
record_05.differ             |  record_05
record_06                    |  record_06.differ
record_09                    |  record_09.right only

$ diff left right
2,5c2,5
< record_02.left only
< record_03
< record_05.differ
< record_06
---
> record_02
> record_04
> record_05
> record_06.differ
8c8
< record_09
---
> record_09.right only

grep can show you when lines exist only in one file and not the other, and you can figure out which file if necessary. But since it’s doing regular expression matches, it will not be able to handle differences within the line unless you edit the file that becomes the pattern file, and it will also get very slow as the file sizes grow.

This example shows all the lines that exist in the file left but not in the file right:

$ grep -vf right left
record_03
record_06
record_09

Note that only “record_03” is really missing; the other two lines are simply different. If you need to detect such variations, you’ll need to use diff. If you need to ignore them, use cut or awk as necessary to isolate the parts you need into temporary files.

uniq -u can show you only lines that are unique in the files, but it will not tell you which file the line came from (if you need to know that, use one of the previous solutions). uniq -d will show you only lines that exist in both files:

$ sort right left | uniq -u
record_02
record_02.left only
record_03
record_04
record_05
record_05.differ
record_06
record_06.differ
record_09
record_09.right only

$ sort right left | uniq -d
record_01
record_07
record_08
record_10

Discussion

comm is your best choice if it’s available and you don’t need the power of diff.

You may need to sort and/or cut or awk into temporary files and work from those if you can’t disrupt the original files.

See Also

  • man cmp

  • man diff

  • man grep

  • man uniq

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.111.85