How to find files that do not have a duplicate in a separate directory

Question

I have an old backup of documents. In my current Documents directory, a lot of these files exist in different locations with different names. I'm trying to find a way to show which files exist in the backup that do not exist in the Documents directory, preferably nice and GUI-y so that I can easily overview a lot of documents.

When I search for this question, a lot of people are looking for ways to do the opposite. There are tools like FSlint and DupeGuru, but they show duplicates. There is no invert mode.

Ron · Answer 1 · 2016-05-12T17:06:17.850

If you are ready to use CLI, the following command should work for you:

diff --brief -r backup/ documents/

This will show you the files that are unique to each folder. If you want you can also ignore filename cases with the --ignore-file-name-case

As an example:

ron@ron:~/test$ ls backup/
file1  file2  file3  file4  file5
ron@ron:~/test$ ls documents/
file4  file5  file6  file7  file8
ron@ron:~/test$ diff backup/ documents/
Only in backup/: file1
Only in backup/: file2
Only in backup/: file3
Only in documents/: file6
Only in documents/: file7
Only in documents/: file8
ron@ron:~/test$ diff backup/ documents/ | grep "Only in backup"
Only in backup/: file1
Only in backup/: file2
Only in backup/: file3

In addition, if you want to report only when the files differ (and not report the actual 'difference'), you can use the --brief option as in:

ron@ron:~/test$ cat backup/file5 
one
ron@ron:~/test$ cat documents/file5
ron@ron:~/test$ diff --brief backup/ documents/
Only in backup/: file1
Only in backup/: file2
Only in backup/: file3
Files backup/file5 and documents/file5 differ
Only in documents/: file6
Only in documents/: file7
Only in documents/: file8

There are several visual diff tools such as meld that can do the same thing. You can install meld from the universe repository by:

sudo apt-get install meld

and use its "Directory comparison" option. Select the folder you want to compare. After selection you can compare them side-by-side:

fdupes is an excellent program to find the duplicate files but it does not list the non-duplicate files, which is what you are looking for. However, we can list the files that are not in the fdupes output using a combination of find and grep.

The following example lists the files that are unique to backup.

ron@ron:~$ tree backup/ documents/
backup/
├── crontab
├── dir1
│   └── du.txt
├── lo.txt
├── ls.txt
├── lu.txt
└── notes.txt
documents/
├── du.txt
├── lo-renamed.txt
├── ls.txt
└── lu.txt

1 directory, 10 files
ron@ron:~$ fdupes -r backup/ documents/ > dup.txt
ron@ron:~$ find backup/ -type f | grep -Fxvf dup.txt 
backup/crontab
backup/notes.txt

I love your elaborate response with pictures so you'll definitely get an upvote from me. And +1 for mentioning `meld`, an awesome tool especially for diffing source code. However, these tricks do not help in the case set out in the question; files existing in different locations with different names. There will need to be some hashing and matching involved. — Redsandro, May 07 '16 at 11:06
@Redsandro Please see my updated answer. Does it work for you? — Ron, May 12 '16 at 17:06
Hi @Ron, thank you for trying to find a solution that works. However, this is still way too much trouble for syncing thousands of files. I ended up deleting all duplicates so I was left with only uniques. I posted the workflow in an answer below. — Redsandro, May 14 '16 at 12:33

score 1 · Answer 2 · answered Jul 27 '20 at 15:49

I had the same problem with a lot of very large files and there are a lot of solutions for duplicates but not for invert search, and I also did not want to search for content diffs because of the large amount of data.

So I wrote this python script to search for "isolated-files"

isolated-files.py --source folder1 --target folder2

this will show any files (recursively) within folder2 which are not in folder1 (also recursively). Can also be used on ssh connections and with multiple folders.

see https://github.com/ezzra/isolated-files

I had a similar problem transferring root owned files over ssh when no root account exists. So I had to write a pair of scripts for that. `sudo scpto` on sending machine and `sudo scpti` on host to receive. — WinEunuuchs2Unix, Jul 27 '20 at 17:23

Redsandro · Answer 3 · 2016-05-14T12:36:34.363

I figured that the best workflow to merge old backups with thousands of files, archived under different directories with different names is to use DupeGuru after all. It looks a lot like the duplicates tab from FSlint, but it has the extra important feature of adding sources as 'reference'.

Add your target directory (e.g. ~/Documents) as a reference.
- A reference is read-only and no files will be removed
Add your backup directory as normal.
Find duplicates. Remove all duplicates that are found from the backup.
You are left with only unique files in the backup directory. Use FreeFileSync or Meld to merge these, or merge manually.

If you have multiple old backup directories, it makes sense to merge the newest backup directory like this first, and then use this backup directory as a reference to clean it's duplicates from the older backups before merging them to the main document directory. This safes a lot of work where you don't have to remove unique files that you want to trash in stead of merge from the backups.

Remember to make a fresh backup after you've destroyed all old backups in the process. :)

How to find files that do not have a duplicate in a separate directory

3 Answers3