4

I would like to whittle down a large database from the command line to N files, very similar to this question does. The only difference is that most of my files are in sub-directories, so I was wondering if there was a quick fix to my problem or if it would require more in depth action. Currently, my command looks like (with the (N+1) replaced with the appropriate number):

find . -type f | sort -R | tail -n +(N+1) | xargs rm

I originally thought this would work because find by nature is recursive, and then I tried adding the -r (recursive flag) around the rm as the output indicates that it is randomly selecting files, but can't find them to delete. Any ideas?

EDIT: My new command looks like this:

find . -type f -print0 | sort -R | tail -n +(N+1) | xargs -0 rm

and now I get the error saying rm: missing operand. Also, I am on a CentOS machine so the -z flag is unavailable to me.

EDIT #2 This command runs:

find . -type f -print0 | sort -R | tail -n +(N+1) | xargs -0 -r rm

but when I execute a find . -type f | wc -l to get the number of files in the directory (which should be N if the command worked correctly) has not changed from the starting amount of files.

Alerra
  • 145
  • 5
  • Note on #2: neither plain `sort` nor plain `tail` works with null-terminated strings the way you want. – Kamil Maciorowski Jul 11 '18 at 19:58
  • This is why I wish I could use the -z command as that is what the internet says to do. Unfortunately CentOS doesn't have this flag. Do you know if there is a way around this? – Alerra Jul 11 '18 at 20:04
  • If you really need `-print0`, you should clearly state it because (in my opinion) this is the only thing that makes your question not a duplicate of the linked one. Then you should also explicitly mention the inability to use `-z` (with `sort`? `tail`? both?). – Kamil Maciorowski Jul 11 '18 at 20:19
  • @KamilMaciorowski I am not really sure if I for certain need `-print0`, but many of the related solutions I have seen have used it. In the question I originally linked, I saw solutions with and without it and so I have tried it both ways. – Alerra Jul 12 '18 at 11:18
  • In what way your original command fails? Any error message(s)? – Kamil Maciorowski Jul 12 '18 at 11:26
  • The original does not yield any error messages with respect to the command itself, but it does fail because when it tries to delete the files, it says, "No such file or directory" for every single one of the randomly selected files. – Alerra Jul 12 '18 at 11:55

2 Answers2

1

If you need to use find … -print0 and you cannot use -z with sort and/or tail, there is a possible, yet cumbersome workaround (substitute (N+1) as usual):

find . -type f -printf "%i\n" | sort | uniq | sort -R | tail -n +(N+1) |
   while read i; do
      find . -type f -inum "$i" -delete
   done

The dirty trick is we use inode numbers instead of paths.

The inner find removes all files with the given inode number under the current directory, so if some files are hardlinked one to another then you will either lose them all or keep them all.

Preliminary sort | uniq is to avoid a mishap when you lose too much because of duplicate inode numbers because of hardlinks. You may end up with more than N filenames, pointing to up to N distinct inodes in total.

In case your find doesn't understand -delete, use -exec rm {} +.

Kamil Maciorowski
  • 69,815
  • 22
  • 136
  • 202
  • Just tried this (had to modify slightly to run directly from command line and not as a script), and it worked perfectly. Thank You! – Alerra Jul 12 '18 at 13:19
0

I did this on osx like

find . -type f -print | sort | uniq | sort --random-sort | tail -n +1000 | xargs rm -f

Where my N was 1000. You can then doublecheck it's the right number of files remainig with ls | wc -l. See also https://stackoverflow.com/a/20307392/630752

Harry Moreno
  • 533
  • 3
  • 12