11

Let's say we have a file /a_long_path_1/foo.doc of size, say, 12345 bytes, and we would like to find all copies of this file in directories /a_long_path_2 and /a_long_path_3 including all their subdirectories recursively. The main parts of the names of the copies may differ from foo (though the extension .doc is likely to stay the same), and the creation/modification dates may be different, but the contents of foo should be the same in its duplicates.

If I issue find /a_long_path_2 /a_long_path_3 -size 12345c -iname \*.doc, the list I get is too large to check manually via diff. Automation is needed. Additional info that might make automation hard: some directory names in the output of this find … command contain spaces.

To be clear: I do NOT wish to find all duplicates of all files on the file system (but all duplicates of only of one particular file), not even as an intermediate step. (Such a list would be huge anyway.)

1 Answers1

14

If I issue find /a_long_path_2 /a_long_path_3 -size 12345c -iname \*.doc, the list I get is too large to check manually via diff. Automation is needed.

Add -exec cmp -s /a_long_path_1/foo.doc {} \; -print:

find /a_long_path_2 /a_long_path_3 \
   -type f \
   -size 12345c \
   -iname \*.doc \
   -exec cmp -s /a_long_path_1/foo.doc {} \; \
   -print

This works because in find -exec is also a test, it succeeds iff the invoked tool returns exit status 0. cmp -s silently returns exit status 0 iff the two given files are identical.

-iname \*.doc can speed things up, but in general it may make you miss some duplicates. -type f and -size 12345c are good preliminary tests for sure.

Kamil Maciorowski
  • 69,815
  • 22
  • 136
  • 202
  • What's the purpose of the `{}`? – Tom Jan 02 '23 at 01:06
  • 2
    @tom find replaces it with the current file name. – Thorbjørn Ravn Andersen Jan 02 '23 at 03:13
  • 3
    @Tom: slightly more specifically: `find` will replace `{}` with one or more file names (exactly one if the `-exec` command ends with a `;`, potentially multiple if it ends with `+`, after being escaped from the shell). The `find` command here will execute `cmp` once per file with a size of 12345c that ends in "doc" (case insensitive). Compare `find . -type f -exec md5sum {} +` which will execute `md5sum` once with potentially multiple files (eg. `md5sum a.txt b.doc c.jpg`). Conveniently, `find` handles all manner of special characters in file names properly, so you don't have to. – minnmass Jan 02 '23 at 04:29
  • 1
    @tom Link added. – Kamil Maciorowski Jan 02 '23 at 04:58
  • Thx! I'm surprised that `{}` works even if the full pathnames of the found files contain spaces. Any hunch why? –  Mar 05 '23 at 18:03
  • 1
    @AlbertNash Spaces tabs and newline characters in pathnames are problematic if you allow a shell to interpret them as entities separating words. [This usually happens if you don't quote right](https://unix.stackexchange.com/q/131766). Newline characters are problematic also because if you treat pathnames as text and use text processing tools for parsing them then the tools will treat newlines as terminators for separate entries. Here `{}` is handled by `find` itself. While calling `cmp`, `find` expands `{}` to exactly one argument in the array of arguments passed to `cmp`. This is robust. – Kamil Maciorowski Mar 05 '23 at 18:13