16

I let a data generation script run too long now have 200,000+ files which I need whittle down to around 1000. From the Linux command line, is there an easy way to delete all but 1000 of these files, where the files that would be retained would have no dependence on filename or any other attribute?

Malcolm Regan
  • 263
  • 1
  • 2
  • 5
  • Did the process that created the files have a characteristic that related each file to the previous one? If so, than selecting randomly would be important to get a representative sample. If the process generated files that are random by nature, you could just delete everything after the first 1000. – fixer1234 Mar 31 '17 at 01:56

4 Answers4

24

Delete all but 1000 random files in a directory

Code:

find /path/to/dir -type f -print0 | sort -zR | tail -zn +1001 | xargs -0 rm

Explanation:

  1. List all files in /path/to/dir with find;
    • print0: use \0 (null character) as the line delimiter; so file paths containing spaces/newlines don't break the script
  2. Shuffle the file list with sort;
    • -z: use \0 (null character) as delimiter, instead of \n (a newline)
    • -R: random order
  3. Strip first 1000 lines from the randomized list with tail;
    • -z: treat the list as zero-delimited (same as with sort)
    • -n +1001: show lines starting from 1001 (ie. omit first 1000 lines)
  4. xargs -0 rm - remove the remaining files;
    • -0: zero-delimited, again

Why it's better than quixotic's solution*:

  1. Works with filenames containing spaces/newlines.
  2. Doesn't try to create any directories (which may already exist, btw.)
  3. Doesn't move any files, doesn't even touch the 1000 "lucky files" besides listing them with find.
  4. Avoids missing a file in case the output of find doesn't end with \n (newline) for some reason.

* - credit to quixotic for | sort -R | head -1000, gave me a starting point.

rld.
  • 524
  • 4
  • 10
  • Running on CentOS 6 I was getting errors about invalid operands. Luckily I am not concerned with spaces in filepaths so removing those operands worked for me `find . -type f | sort -R | tail -n +1001 | xargs rm` – brad May 31 '18 at 20:09
  • @brad Could you provide the error messages and your version of `find`? I'll try to improve my answer, just need some input to work with. – rld. May 31 '18 at 22:10
  • 4
    `tail: invalid option -- 'z'` the version of tail I have is 8.4 – brad Jun 01 '18 at 13:59
  • I would add --no-run-if-empty to xargs to avoid error if there is no file (after running it twice for exemple) – fraff May 21 '19 at 10:00
  • rm: missing operand – pceccon Oct 27 '21 at 18:07
1

Use a temporary directory, then find all your files, randomize the list with sort, and move the top 1000 of the list into the temporary directory. Delete the rest, then move the files back from the temporary directory.

$ mkdir ../tmp-dir
$ find . -type f | sort -R | head -1000 | xargs -I "I" mv I ../tmp-dir/
$ rm ./*
$ mv ../tmp-dir/* .

If xargs complains about line length, use a smaller number with head and repeat the command as needed (ie, change -1000 to -500 and run it twice, or change to -200 and run it 5 times.)

It will also fail to handle filenames that include spaces; as @rld's answer shows, you can use find's -print0 argument, the -z arguments to sort and head, and -0 with xargs to ensure proper filename handling.

Finally, if the tmp-dir already exists, you should substitute a directory name that doesn't exist.

quixotic
  • 849
  • 6
  • 11
  • This will fail if any of the filenames listed by `find` include a space. – rld. Mar 08 '17 at 04:47
  • For the temp dir, check out: https://stackoverflow.com/questions/4632028/how-to-create-a-temporary-directory – Erk Dec 31 '21 at 11:14
0

For mac users the following script should do.

find . -type f -print0 | tr '\0' '\n' | sort -R | tail -n +10000 | tr '\n' '\0' | xargs -0 rm

tr will allow sort and tail to work on lists with \n instead of \0.

-3

The easiest might be to rm -rf the directory, then re-run the data generation script while making sure not to run for too long.

Lars Poulsen
  • 59
  • 1
  • 8