How to deduplicate 40TB of data?

Question

I've inherited a research cluster with ~40TB of data across three filesystems. The data stretches back almost 15 years, and there are most likely a good amount of duplicates as researchers copy each others data for different reasons and then just hang on to the copies.

I know about de-duping tools like fdupes and rmlint. I'm trying to find one that will work on such a large dataset. I don't care if it takes weeks (or maybe even months) to crawl all the data - I'll probably throttle it anyway to go easy on the filesystems. But I need to find a tool that's either somehow super efficient with RAM, or can store all the intermediary data it needs in files rather than RAM. I'm assuming that my RAM (64GB) will be exhausted if I crawl through all this data as one set.

I'm experimenting with fdupes now on a 900GB tree. It's 25% of the way through and RAM usage has been slowly creeping up the whole time, now it's at 700MB.

Or, is there a way to direct a process to use disk-mapped RAM so there's much more available and it doesn't use system RAM?

I'm running CentOS 6.

The filesystems are XFS, in case that's relevant. That is, I know it's not an fs with de-duping capabilities like XFS. — Michael Stauffer, Aug 22 '14 at 20:09
why are you worried about RAM in the first place? OS has its own memory management algorithms and the fact that RAM usage is "creeping up" does not mean it will eventually eat up all your RAM. I am pretty sure it won't happen. — Art Gertner, Aug 23 '14 at 00:23
I don't know how dedicated tools work, but you could calculate hash for each file and log it along with file path, then sort by hashes and deduplicate. It should be doable with a simple Python script or maybe even in Bash. RAM usage should be minimal except for the sorting step, but I guess you could use some kind of modified mergesort to keep it reasonably low. — gronostaj, Aug 24 '14 at 17:51
Yes, dedicated tool calculate hash, but first do things like look at file size, and hash only the start of files to limit the amount of full hashes that need calculating. — Michael Stauffer, Aug 25 '14 at 17:10
As for RAM, I was worred about slowing down the fileserver - see my comment below to the Answer. — Michael Stauffer, Aug 25 '14 at 17:10
If you're up for some perl scripting, you might find [this article](http://perltricks.com/article/111/2014/8/29/Facing-the-music-with-Perl) by brian d foy about de-duplicating his music collection interesting. — Kenster, Sep 03 '14 at 02:17
Your question has insufficient information: what is the structure of the data, i.e. what is the data granularity that you have to base decisions on? Are we talking about removing duplicate files, removing duplicate lines in/across ASCII files, removing duplicate records in a DBMS, etc? The answers will be very much determined by that information. Please edit your question and while you're doing that, edit in your other comment answers as well. — Jan Doggen, Sep 09 '14 at 13:52

score 4 · Answer 1 · answered Aug 24 '14 at 17:42

4

Or, is there a way to direct a process to use disk-mapped RAM so there's much more available and it doesn't use system RAM?

Yes, It's called the swap drive. You probably already have one. If you're worried about running out of RAM then increasing this is a good place to start. It works automatically though so there is no need to do anything special.

I would not worry about fdupes. Try it, it should work without problems.

answered Aug 24 '14 at 17:42

krowe

5,493
1
24
31

I was thinking that relying on swap would slow the whole system down - it's a busy fileserver. But maybe not enough to worry about? I could use ulimit to prevent the process from using more than system ram in any case, I suppose, as a failsafe. But seems like krowe and smc don't think fdupes would use that much ram anyway, so I should just give it a try. – Michael Stauffer Aug 25 '14 at 17:09

score 1 · Answer 2 · edited Sep 03 '14 at 04:29

1

finding duplicates based on hashkey works well and is very fast.

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

edited Sep 03 '14 at 04:29

FiveO

8,228
12
48
77

answered Sep 03 '14 at 01:54

kumar

61
1
2

score 0 · Answer 3 · answered Sep 07 '14 at 20:28

Write a quick app to walk the trees, either pushing (hash, mtime)=>filepath into a dictionary or marking the file for deletion if the entry already exists. The hash will just be an MD5 calculated over the first N bytes. You might do a couple of different passes, with a hash over a small N and then another with a hash over a large N.

You could probably do this in less than twenty- or thirty-lines of Python (using os.walk()).

How to deduplicate 40TB of data?

3 Answers3