33

I downloaded a lot of images in a directory.
Downloader renamed files which already exist.
I also renamed some of the files manually.

a.jpg
b.jpg
b(2).jpg
hello.jpg      <-- manually renamed `b(3).jpg`
c.jpg
c(2).jpg
world.jpg      <-- manually renamed `d.jpg`
d(2).jpg
d(3).jpg

How to remove duplicated ones? The result should be:

a.jpg
b.jpg
c.jpg
world.jpg

note: name doesn't matter. I just want uniq files.

kev
  • 12,462
  • 13
  • 59
  • 72

15 Answers15

66

fdupes is the tool of your choice. To find all duplicate files (by content, not by name) in the current directory:

fdupes -r .

To manually confirm deletion of duplicated files:

fdupes -r -d .

To automatically delete all copies but the first of each duplicated file (be warned, this warning, this actually deletes files, as requested):

fdupes -r -f . | grep -v '^$' | xargs rm -v

I'd recommend to manually check files before deletion:

fdupes -rf . | grep -v '^$' > files
... # check files
xargs -a files rm -v
Jakob
  • 891
  • 7
  • 7
  • Works great, but fails if file names contain spaces. – Daniel Wolf Jun 23 '17 at 12:15
  • 2
    @DanielWolf try with xargs option `-d '\n'` – Jakob Jun 27 '17 at 12:13
  • 8
    Also, newer versions of fdupes have the built-in option to delete all but the first in a list of duplicate files: `fdupes -rdN .` where -r is recursive, -d is delete and -N is no-prompt – Rand May 15 '19 at 22:38
  • Thank you, This is outstanding because can detect more than 2 duplicates and allows you to select which one of the dups you want to preserve (or all of them). – Smeterlink Aug 04 '19 at 16:39
  • xargs: unterminated quote – Alexey Sh. Jan 03 '20 at 05:53
  • [jdupes](https://github.com/jbruchon/jdupes) is a fork of fdupes that is much faster and has more features, including exclusion based on file extension, substring, or modify date, or deduplication by hardlink/symlink/copy-on-write dedupe. Most major Linux distributions have it in their repositories. – Jody Bruchon Jul 10 '20 at 19:57
  • @DanielWolf [fclones](https://github.com/pkolaczk/fclones) can handle spaces, control characters and unicode/non-unicode characters in file paths properly. – Piotr Kolaczkowski May 13 '22 at 16:22
33

bash 4.x

#!/bin/bash
declare -A arr
shopt -s globstar

for file in **; do
  [[ -f "$file" ]] || continue
   
  read cksm _ < <(md5sum "$file")
  if ((arr[$cksm]++)); then 
    echo "rm $file"
  fi
done

This is both recursive and handles any file name. Downside is that it requires version 4.x for the ability to use associative arrays and recursive searching. Remove the echo if you like the results.

gawk version

gawk '
  {
    cmd="md5sum " q FILENAME q
    cmd | getline cksm
    close(cmd)
    sub(/ .*$/,"",cksm)
    if(a[cksm]++){
      cmd="echo rm " q FILENAME q
      system(cmd)
      close(cmd)
    }
    nextfile
  }' q='"' *

Note that this will still break on files that have double-quotes in their name. No real way to get around that with awk. Remove the echo if you like the results.

SiegeX
  • 2,359
  • 3
  • 17
  • 21
  • fine, the bash version worked for me, but in my test, with 2 similar folders, it deleted half of duplicates in one folder, and half in the other. why. i would expect deletion of everyone (duplicated) of one folder. – Ferroao Dec 05 '17 at 17:58
  • @Ferroao Perhaps they were not exact duplicates. If just one bit is off the md5 hash that my script is using to determine duplicity would be completely different. You could add an `echo cksm` just after the line starting with `read` if you want to see each file’s hash. – SiegeX Dec 05 '17 at 20:38
  • no, all "duplicates" (copies) were removed, remaining 1 version, let's say the original. half copies were deleted from one folder, and the other half from the other folder (100% deletion of copies). my 100% is for copies in excess, not of the totality – Ferroao Dec 05 '17 at 22:40
  • @Ferroao I see. In that case it seems when bash does its recursive path expansion via `**`, it orders the list in such a way that the two folders are interleaved rather than all of folder 1 then all of folder 2. The script will always leave the first ‘original’ it hits as it iterates through the list. You can `echo $file` before the `read` line to see if this is true. – SiegeX Dec 06 '17 at 00:49
  • Don't reinvent the wheel. This is a kind of dangerous operation and you really don't want to use a program written by someone while browsing stackexchange during a boring meeting. – Nobody Mar 05 '20 at 11:57
  • @Nobody Perhaps you could enlighten us regarding the danger? – Kevin Whitefoot Jun 08 '21 at 18:39
2

I recommend fclones.

Fclones is a modern duplicate file finder and remover written in Rust, available on most Linux distros and macOS.

Notable features:

  • supports spaces, non-ASCII and control characters in file paths
  • allows to search in multiple directory trees
  • respects .gitignore files
  • safe: allows to inspect the list of duplicates manually before performing any action on them
  • offers plenty of options for filtering / selecting files to remove or preserve
  • very fast

To search for duplicates in the current directory simply run:

fclones group . >dupes.txt

Then you can inspect the dupes.txt file to check if it found the right duplicates (you can also modify that list to your liking).

Finally remove/link/move the duplicate files with one of:

fclones remove <dupes.txt
fclones link <dupes.txt
fclones move target <dupes.txt
fclones dedupe <dupes.txt   # copy-on-write deduplication on some filesystems

Example:

pkolaczk@p5520:~/Temp$ mkdir files
pkolaczk@p5520:~/Temp$ echo foo >files/foo1.txt
pkolaczk@p5520:~/Temp$ echo foo >files/foo2.txt
pkolaczk@p5520:~/Temp$ echo foo >files/foo3.txt

pkolaczk@p5520:~/Temp$ fclones group files >dupes.txt
[2022-05-13 18:48:25.608] fclones:  info: Started grouping
[2022-05-13 18:48:25.613] fclones:  info: Scanned 4 file entries
[2022-05-13 18:48:25.613] fclones:  info: Found 3 (12 B) files matching selection criteria
[2022-05-13 18:48:25.614] fclones:  info: Found 2 (8 B) candidates after grouping by size
[2022-05-13 18:48:25.614] fclones:  info: Found 2 (8 B) candidates after grouping by paths and file identifiers
[2022-05-13 18:48:25.619] fclones:  info: Found 2 (8 B) candidates after grouping by prefix
[2022-05-13 18:48:25.620] fclones:  info: Found 2 (8 B) candidates after grouping by suffix
[2022-05-13 18:48:25.620] fclones:  info: Found 2 (8 B) redundant files

pkolaczk@p5520:~/Temp$ cat dupes.txt
# Report by fclones 0.24.0
# Timestamp: 2022-05-13 18:48:25.621 +0200
# Command: fclones group files
# Base dir: /home/pkolaczk/Temp
# Total: 12 B (12 B) in 3 files in 1 groups
# Redundant: 8 B (8 B) in 2 files
# Missing: 0 B (0 B) in 0 files
6109f093b3fd5eb1060989c990d1226f, 4 B (4 B) * 3:
    /home/pkolaczk/Temp/files/foo1.txt
    /home/pkolaczk/Temp/files/foo2.txt
    /home/pkolaczk/Temp/files/foo3.txt

pkolaczk@p5520:~/Temp$ fclones remove <dupes.txt
[2022-05-13 18:48:41.002] fclones:  info: Started deduplicating
[2022-05-13 18:48:41.003] fclones:  info: Processed 2 files and reclaimed 8 B space

pkolaczk@p5520:~/Temp$ ls files
foo1.txt
2

You can try FSLint. It has both command line and GUI interface.

Bibhas
  • 2,564
  • 2
  • 17
  • 21
1

More concise version of removing duplicated files(just one line)

young@ubuntu-16:~/test$ md5sum `find ./ -type f` | sort -k1 | uniq -w32 -d | xargs rm -fv

find_same_size.sh

#!/usr/bin/env bash
#set -x
#This is small script can find same size of files.
find_same_size(){

if [[ -z $1 || ! -d $1 ]]
then
echo "Usage $0 directory_name" ;
 exit $?
else
dir_name=$1;
echo "current directory is $1"



for i in $(find $dir_name -type f); do
   ls -fl $i
done | awk '{f=""
        if(NF>9)for(i=9;i<=NF;i++)f=f?f" "$i:$i; else f=$9;
        if(a[$5]){ a[$5]=a[$5]"\n"f; b[$5]++;} else a[$5]=f} END{for(x     in b)print a[x] }' | xargs stat -c "%s  %n" #For just list files
 fi
   }

find_same_size $1


young@ubuntu-16:~/test$ bash find_same_size.sh tttt/ | awk '{ if($1 !~   /^([[:alpha:]])+/) print $2}' | xargs md5sum | uniq -w32 -d | xargs rm -vf
niceguy oh
  • 11
  • 3
1

Here is one-liners (based on answer from Prashant Lakhera).

Preview:

find . -type f | xargs -I {} md5sum "{}" | sort -k1 | uniq -w32 -d | cut -d" " -f2- | xargs -I {} echo "{}"

Remove:

find . -type f | xargs -I {} md5sum "{}" | sort -k1 | uniq -w32 -d | cut -d" " -f2- | xargs -I {} rm -f "{}"

And here is slightly more complex version, that tries to preserve files that resides deeper in directory tree and have longer filenames. Presumably this files have been sorted manually.

find . -type f | xargs -I {} md5sum "{}" | awk '{print gsub("/","/",$0), length, $0}' | sort -k3,3 -k2,2n -k1,1n | cut -d" " -f3- | uniq -w32 -d | cut -d" " -f2- | xargs -I {} rm -f "{}"

Drawback: if you have more than 2 copies of a file you have to launch command multiple times.

1

How to test files having unique content?

if diff "$file1" "$file2" > /dev/null; then
    ...

How can we get list of files in directory?

files="$( find ${files_dir} -type f )"

We can get any 2 files from that list and check if their names are different and content are same.

#!/bin/bash
# removeDuplicates.sh

files_dir=$1
if [[ -z "$files_dir" ]]; then
    echo "Error: files dir is undefined"
fi

files="$( find ${files_dir} -type f )"
for file1 in $files; do
    for file2 in $files; do
        # echo "checking $file1 and $file2"
        if [[ "$file1" != "$file2" && -e "$file1" && -e "$file2" ]]; then
            if diff "$file1" "$file2" > /dev/null; then
                echo "$file1 and $file2 are duplicates"
                rm -v "$file2"
            fi
        fi
    done
done

For example, we have some dir:

$> ls .tmp -1
all(2).txt
all.txt
file
text
text(2)

So there are only 3 unique files.

Lets run that script:

$> ./removeDuplicates.sh .tmp/
.tmp/text(2) and .tmp/text are duplicates
removed `.tmp/text'
.tmp/all.txt and .tmp/all(2).txt are duplicates
removed `.tmp/all(2).txt'

And we get only 3 files leaved.

$> ls .tmp/ -1
all.txt
file
text(2)
1

I wrote this tiny script to delete duplicated files

https://gist.github.com/crodas/d16a16c2474602ad725b

Basically it uses a temporary file (/tmp/list.txt) to create a map of files and their hashes. Later I use that files and the magic of Unix pipes to do the rest.

The script won't delete anything but will print the commands to delete files.

mfilter.sh ./dir | bash

Hope it helps

crodas
  • 111
  • 2
0

I found an easier way to perform the same task

for i in `md5sum * | sort -k1 | uniq -w32 -d|awk '{print $2}'`; do
rm -rf $i
done
0

Most and possibly all of the remaining answers are terribly inefficient by computing the checksum of each and every file in the directory to process.

A potentially orders of magnitude faster approach is to first get the size of each file, which is almost immediate (ls or stat), and then compute and compare the checksums only for the files having a non unique size and only keep one instance of the files sharing both their size and checksum.

Note that even though theoretically hash collisions can occur, there are not enough jpeg files on the entire Internet for a hash collision to reasonably have a chance to happen. Two files sharing both their size and checksum are identical for all intents and purposes.

See: How reliable are SHA1 sum and MD5 sums on very large files?

jlliagre
  • 13,899
  • 4
  • 31
  • 48
  • Yes, and you should not take a decision on the checksum only. It denotes a very high probability of content equality, but no more. So, if there is only two files of same size to be compared, it is useless to compute the checksum because in all cases you will need to compare them bit by bit to certify the equality. – Laurent Simon Jan 14 '21 at 16:19
  • @LaurentSimon Yes, that's what my answer is telling. – jlliagre Jan 14 '21 at 16:26
  • @jliagre, no, your answer is saying that a checksum is enough to identify duplicates. If you can't tolerate false positives, you have to compare bit by bit, like Laurent pointed out. – Dominykas Mostauskis Feb 27 '21 at 17:12
  • @DominykasMostauskis You are right, I'm telling a checksum is enough to identify duplicates. Comparing bit by bit can be done but is mostly pointless, unless there are hundred of billions of files in that directory. Answer updated. – jlliagre Feb 28 '21 at 00:01
  • @LaurentSimon Certifying the files are identical by comparing them bit by bit can of course be done but is essentially pointless. An accidental hash collision has certainly much less chance to occur than an accidental bit corruption anyway. – jlliagre Feb 28 '21 at 12:46
  • @jliagre You are perfectly right: "much less chance". But, there is still a chance. Good luck... – Laurent Simon Mar 01 '21 at 18:10
  • @LaurentSimon Yes, and there is a chance for **1, 2, 3, 4** and **5** to be drawn twice in a raw in the Euromillions lottery. I wouldn't hold my breath though... – jlliagre Mar 01 '21 at 20:55
0

This is not what you are asking, but I think someone might find it useful when the checksums are not the same, but the name is similar (with suffix in parentheses). This script removes the files with suffixes as ("digit")

#! /bin/bash
# Warning: globstar excludes hidden directories.
# Turn on recursive globbing (in this script) or exit if the option is not supported:
shopt -s globstar || exit
for f in **
do
extension="${f##*.}"
#get only files with parentheses suffix
FILEWITHPAR=$( echo "${f%.*}".$extension | grep -o -P "(.*\([0-9]\)\..*)")
# print file to be possibly deleted
if [ -z "$FILEWITHPAR" ] ;then
:
else
echo "$FILEWITHPAR ident"
# identify if a similar file without suffix exists
FILENOPAR=$(echo $FILEWITHPAR | sed -e 's/^\(.*\)([0-9])\(.*\).*/\1\2/')
echo "$FILENOPAR exists?"
if [ -f "$FILENOPAR" ]; then
#delete file with suffix in parentheses
echo ""$FILEWITHPAR" to be deleted"
rm -Rf "$FILEWITHPAR"
else
echo "no"
fi
fi
done
Ferroao
  • 160
  • 12
0

There is a beautiful solution on https://stackoverflow.com/questions/57736996/how-to-remove-duplicate-files-in-linux/57737192#57737192:

md5sum prime-* | awk 'n[$1]++' | cut -d " " -f 3- | xargs rm

Another very clear and nice solution is mentioned on https://unix.stackexchange.com/questions/192701/how-to-remove-duplicate-files-using-bash:

md5sum * | sort -k1 | uniq -w 32 -d
NicolasBourbaki
  • 123
  • 1
  • 1
  • 4
0

Here is an alternate version which runs on Mac , example to filter set to *.png

Preview

md5sum *.png | sort | awk 'BEGIN { val=$1} { if ($1 == val) print $2; val=$1}'

Delete

 md5sum *.png | sort | awk 'BEGIN { val=$1} { if ($1 == val) print $2; val=$1}' | xargs -I X rm -f "X"

Examples of creating aliases for csh/tcsh

alias lsdup "md5sum \!:1 | sort | awk 'BEGIN { val="\$"1} { if ( "\$"1 == val ) print "\$"2 ; val="\$"1}'"
alias rmdup "md5sum \!:1 | sort | awk 'BEGIN { val="\$"1} { if ( "\$"1 == val ) print "\$"2 ; val="\$"1}'| xargs -I X rm -fv X"
LanDenLabs
  • 131
  • 2
0

try cldup, https://github.com/jkzhang2019/cldup

I have images of almost 1TB that collected over ten years. To remove duplicate files I created this project and wit works well.

With cldup you can create a database of all your files and identify any duplicate file with one line of command.

The database can be maintained increasingly, so once a database was created, you can updated it in minutes.

  • btw, It's a bash script. – user1809385 Jun 15 '23 at 01:35
  • 1
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/1194723) – Rohit Gupta Jun 15 '23 at 02:20
-1

I found a small program that really simplifies this kind of tasks: fdupes.

Ricky Neff
  • 205
  • 2
  • 3