12

Is there a way to limit the number of results returned by the find command on a unix system?

We are having performance issues due to an unusually large number of files in some directories.

I'm trying to do something like:

find /some/log -type f -name *.log -exec rm {} ; | limit 5000
blahdiblah
  • 4,873
  • 2
  • 22
  • 30
lemotdit
  • 265
  • 2
  • 3
  • 6

6 Answers6

30

You could try something like find [...] |head -[NUMBER]. This will send a SIGPIPE to find when head outputs its however-many lines so that find doesn't continue its search.

Caveat: find outputs files in the order they appear in the directory structure. Most *NIX file systems do not order directories by entry name. This means the results are given in an unpredictable order. find |sort will put the list in ASCIIbetical order.

Another caveat: It's exceedingly rare to see in the wild, but *NIX filenames can contain newline characters. Many programs get around this by optionally using a NUL byte (\0) as the record separator.

Most *nix text-processing utilities have the option to use a NUL as a record separator instead of a newline. Some examples:

  • grep -z
  • xargs -0
  • find -print0
  • sort -z
  • head -z
  • perl -0

Putting this all together, to safely remove the first 5000 files, in alphabetical order:

find /some/log -type f -name '*.log' -print0 |
sort -z |
head -5000 -z |
xargs -0 rm

* Line breaks here are added for clarity. You could execute this all on one line provided you make sure to not delete the | (vertical pipe) separating the commands.

amphetamachine
  • 1,668
  • 2
  • 12
  • 15
  • 1
    I should had add that i'm also using the -exec arg. The HEAD works if the command is something like LS. But it does not work in my case since i'm using RM and thats seems to take all the files in one execution. find /some/log -type f -name *.log -exec rm {} ; | HEAD -5000 – lemotdit Feb 22 '10 at 21:11
  • 6
    Instead of using -exec rm, just pipe the results of find to head as suggested, and then pipe the result to xargs and rm. – Paul R Feb 22 '10 at 21:53
  • 1
    Good to know. I was assuming `find` would continue traversing the (potentially huge) file system when all you might want is a sample (my particular use case is to get one file from every directory). – Sridhar Sarnobat Aug 16 '16 at 18:14
  • When using `xargs` in a pipeline, remember that it will always be executed, even if `find` didn't find anything. Thus you should use `-f`. Further, if the file begins with `-`, you should use `--` so the filename isn't interpreted as an argument. – Mike Frysinger Jun 02 '23 at 14:26
  • @MikeFrysinger Those are both good points, but a) the `--` is not needed here since `find /some/log` will prefix filenames with `/some/log`, so there's no possibility of beginning with a `-`. Same as with `find` with no arguments; it'll prefix entries with `./` and b) The `-f` was not included OP was calling rm(1) without it in their code example. – amphetamachine Aug 21 '23 at 16:12
7

It sounds like you're looking for xargs, but don't know it yet.

find /some/log/dir -type f -name "*.log" | xargs rm
blahdiblah
  • 4,873
  • 2
  • 22
  • 30
  • 5
    `-exec rm {} +` would do the same thing without the overhead. though you could add `head` to the pipe chain: `find [...] | head -5000 | xargs rm` – quack quixote Feb 22 '10 at 21:57
  • 11
    Note that just `find ...|xargs` is *dangerous*, as it will do funny/weird/disastrous things if some file name contains funny characters. Always use `find ... -print0 | xargs -0` (GNU extension, I believe). – sleske Feb 22 '10 at 22:58
  • 14
    Instead of `-exec rm...` or `xargs rm` you could use find's `-delete` flag. – Martin Hilton Feb 22 '10 at 23:19
  • The question is "Is there a way to limit the number of results returned by the find command on a unix system?" and your answer does nothing to limit the number of results. It kinda solves the underlying problem because `xargs` by default works more like `-exec … {} +` of `find` (as opposed to `-exec … \;` the OP tried to use), so there is significantly less `rm` processes started. Still **it's not an answer to the explicit question**. – Kamil Maciorowski Jul 07 '22 at 16:58
  • _"Still it's not an answer to the explicit question."_ While you're correct, it appears to be the actual issue that OP is actually trying to address, and that matters more. – Mike Frysinger Jun 02 '23 at 14:15
1

If you have a very large number of files in your directories, and/or when using pipes may not apply, etc., for instance because xargs would be limited by the number of arguments allowed by your system, one option is to use the exit status of an exec command as a filter for the next actions, something like:

rm /tmp/count ; find . -type f -exec bash -c 'echo "$(( $(cat /tmp/count) + 1 ))" > /tmp/count' \; -exec bash -c 'test $( cat /tmp/count ) -lt 5000' \; -exec echo "any command instead of echo of this file: {}" \;

The first exec will just increment the counter. The second exec tests the count, if less than 5000, then exits with 0 and the next command is executed. The third exec will do the intended on the file, in this case a simple echo, we can also -print -delete, etc. (I would use -delete instead of -exec rm {} \; for instance.

This is all based on the fact that find actions are executed in sequence assuming the previous one returns 0.

When using the above example, you'd want to make sure /tmp/count is not used by a concurrent process.

[edits following comments from Scott] Thanks a lot Scott for your comments.

Based on them: the number was changed to 5,000 to match the initial thread.

Also: this is absolutely correct that /tmp/count file will still be written 42,000 times (as many times as files being browsed), so "find" will still go through all the 42,000 entries,but will only execute the command of interest 5,000 times. So this command will not avoid browsing the whole and is just presented as an alternate option to usual pipes. Using a memory mapped temporary directory to host this /tmp/count file would seem appropriate.

And besides your comments, some additional edits: Pipes would be simpler in most typical cases.

Please find below more reasons for which pipes would not apply that easily though:

  • when file names have spaces in them, the "find" exec command would not want to forget to surround the {} with quotes "{}", to support this case,

  • when the intended command does not allow having all the file names in a raw, for instance, something like: -exec somespecificprogram -i "{}" -o "{}.myoutput" \;

So this example is essentially posted for those around who would have faced challenges with pipes and still do not want to go into a more elaborated programming option.

wang
  • 3
  • 1
wang
  • 11
  • 1
  • I don’t entirely understand the question — I guess that the OP has 42 000 `.log` files that they want to delete, but they want to delete only 5 000 at a time — *because handling all 42 000 at once slows down the system too much.*  This solution will perform the action (e.g., deletion) on only the first *N* files (confusingly, you have written your answer with *N* = 10 instead of the OP’s 5 000), but it will update the ``count`` file 42 000 times. – Scott - Слава Україні Mar 30 '19 at 23:54
  • 1
    First of all, welcome to Super User! We always appreciate contributions from new community members, but you apparently have two Super User accounts: [this one](https://superuser.com/users/1014501/wang) and [this one](https://superuser.com/users/1016275/wang). Please take the time to utilize the following Help Center tutorial and ask the Super User staff to merge your accounts: [I accidentally created two accounts; how do I merge them?](https://superuser.com/help/merging-accounts) – Run5k Apr 03 '19 at 13:05
0

Just |head didn't work for me:

root@static2 [/home/dir]# find . -uid 501 -exec ls -l {} \; | head 2>/dev/null
total 620
-rw-r--r--  1 root   root           55 Sep  8 15:22 08E7384AE2.txt
drwxr-xr-x  3 lamav statlus 4096 Apr 22  2015 1701A_new_email
drwxr-xr-x  3 lamav statlus 4096 Apr 22  2015 1701B_new_email
drwxr-xr-x  3 lamav statlus 4096 May 11  2015 1701C_new_email
drwxr-xr-x  2 lamav statlus 4096 Sep 24 18:58 20150924_test
drwxr-xr-x  3 lamav statlus 4096 Jun  4  2013 23141_welcome_newsletter
drwxr-xr-x  3 lamav statlus 4096 Oct 31  2012 23861_welcome_email
drwxr-xr-x  3 lamav statlus 4096 Sep 19  2013 24176_welco
drwxr-xr-x  3 lamav statlus 4096 Jan 11  2013 24290_convel
find: `ls' terminated by signal 13
find: `ls' terminated by signal 13
find: `ls' terminated by signal 13
find: `ls' terminated by signal 13
find: `ls' terminated by signal 13

(...etc...)

My (definitely not the best) solution:

find . -uid 501 -exec ls -l {} \; 2>/dev/null | head

The disadvantage is that the 'find' itself isn't terminated after required number of lines, and run in background until ^C or end, therefore ideas are welcomed.

Putnik
  • 912
  • 1
  • 6
  • 16
0
find /some/log -type f -name *.log -exec rm {} ; | limit 5000

Well, the command as quoted will not work, of course (limit isn't even a valid command).

But if you run something similar to the find command above, it's probably a classic problem. You're probably having performance problems because find runs rm once for every file.

You want to use xargs, it can combine several files into one command line, so it will invoke rm a limited times for many files at once, which is much faster.

sleske
  • 22,652
  • 10
  • 69
  • 93
0

Your "performance issues" are probably because find … -exec rm {} \; runs one rm per matching file. find … -exec rm {} + should perform better. If your find supports -delete then find … -delete should perform even better.

But your explicit question is [emphasis mine]:

Is there a way to limit the number of results returned by the find command on a Unix system?

If "returned" means "printed to stdout", then find … -print | head … (which cannot handle arbitrary names well) or find … -print0 | head -z … (which is not portable) is the answer.

Still you want to do something with the result. Piping to xargs (like in other answers you got) is fully reliable only if you use null-terminated lines: find … -print0 | head -z … | xargs -0 …. This is not portable.

The following code is a portable* way to make find process (in this case: remove) at most 5000 regular files with names matching *.log under /some/log:

while :; do echo; done | head -n 4999 \
| find /some/log -type f -name '*.log' -exec sh -c '
   for pathname do
      </dev/tty rm "$pathname" \
      && { read dummy || { kill -s PIPE "$PPID"; exit 0; } }
   done
' find-sh {} +

This is how the code works:

  • find starts sh and passes possibly many pathnames to it as arguments. There may be more than one sh started one after another, the number doesn't matter.

  • sh attempts to rm files one by one in a loop. After a successful remove operation it tries to read exactly one line from its stdin inherited from find.

  • while … | head -n 4999 (which could be yes | head -n 4999, but yes is not portable) generates exactly 4999 lines. Unless we run out of files first, exactly 4999 reads will succeed. The read after the 5000th successful move operation will be the first read that fails.

  • Failed read occurs exactly after the 5000th successful move operation. It causes two things:

    • find ($PPID, the parent process of sh) gets SIGPIPE, so it won't start more sh processes;
    • the current sh exits, so it won't process more pathnames.

Notes:

  • To remove 5000 files you need 4999 in the code.

  • I fixed your flawed -name *.log.

  • find-sh is explained here: What is the second sh in sh -c 'some shell code' sh?

  • The solution runs one rm per matching file. It won't perform better than your original code. It's an answer to your question about limiting the number. You asked for it, you got it.

  • The solution may be adapted to any action, not necessarily rm. In this another answer of mine it's mv, but in general it can be anything (possibly in a form of a huge script). To just print, use printf.

  • Anything that uses find … -exec foo … {} … or find … | xargs … foo … is prone to a race condition. Between find finding the file and foo doing something, the path to the file may be manipulated, so foo sees a different file than the one tested by find. E.g. if a rogue party removes the file and places a symlink to another file in its place, then foo will possibly work with the wrong file. In case of rm this means removing the malicious symlink, not its target, so not that bad; but if the rogue plants a symlink in place of a subdirectory then rm may actually remove the wrong file. This is especially relevant when running find as root in a directory where others can create and remove files.

    -delete provided by GNU find removes the race condition where someone may be able to make you remove the wrong files by changing a directory to a symlink in-between the time find finds a file and rm removes it (see info -f find -n 'Security Considerations for find' for details). This is how you can limit the number of files deleted by -delete in a GNU system:

    yes | head -n 4999 \
    | find /some/log -type f -name '*.log' -delete \
      \( -exec sh -c 'read dummy' find-sh \; -o -quit \)
    

    The above code runs one sh per deleted file. The below code is somewhat simpler but noisy, it runs one true per deleted file.

    yes | head -n 4999 \
    | find /some/log -type f -name '*.log' -delete \( -ok true \; -o -quit \)
    

* AFAIK it's portable.

Kamil Maciorowski
  • 69,815
  • 22
  • 136
  • 202