9

Not uncommonly I have to count the number of files in a directory, sometimes this runs into the millions.

Is there a better way than just enumerating and counting them with find . | wc -l ? Is there some kind of filesystem call you can make on ext3/4 that is less I/O intensive?

phuclv
  • 26,555
  • 15
  • 113
  • 235
MattPark
  • 1,235
  • 1
  • 8
  • 18
  • 4
    You're counting not only files, but directories, too. If you only want to count files, use "find . -type f | wc -l" if you want to count symbolic links and regular files, use "find . -type f -or -type l | wc -l" – FSMaxB Dec 17 '13 at 11:22
  • A directory is a kind of file, as are devices, symlinks and sockets. Regular files are a subset of files. – Toby Speight Feb 13 '17 at 15:46
  • 1
    The example you give suggests that you want a *recursive* count - if not, then you need `find -maxdepth 1`. Note that with your current approach, you'll double-count any name that contains a newline character. – Toby Speight Feb 13 '17 at 15:49
  • 1
    https://github.com/ChristopherSchultz/fast-file-count – CMCDragonkai Oct 05 '18 at 07:09

4 Answers4

16

Not a fundamental speed-up but at least something :)

find . -printf \\n | wc -l

You really do not need to pass the list of file names, just the newlines suffice. This variant is about 15 % faster on my Ubuntu 12.04.3 when the directories are cached in RAM. In addition this variant will work correctly with file names containing newlines.

Interestingly this variant seems to be a little bit slower than the one above:

find . -printf x | wc -c

Special case - but really fast

If the directory is on its own file system you can simply count the inodes:

df -i .

If the number of directories and files in other directories than the counted one do not change much you can simply subtract this known number from the current df -i result. This way you will be able to count the files and directories very quickly.

  • "This variant is about 15 % faster..." makes me wonder if there is some kind of handy trick you are using to time these? – Brian Z Dec 17 '13 at 00:31
  • 5
    @BrianZ: You can time a command by prepending the command with time. `time find /usr/src/ -printf \\n | wc -l`, you can clear the caches in between runs with `sudo sync && sudo sysctl -w vm.drop_caches=3` – MattPark Dec 17 '13 at 02:42
  • So I saw a consistent 2% increase in speed with either of the first 2 options without caching. So yeah that's a pretty cool way of doing it. Counting the inodes is definitely the best if your environment is setup for that. I hadn't considered it. – MattPark Dec 17 '13 at 03:18
  • Is `-printf x` meant to be the same as `-printf '\0'`? I don't see it mentioned in the docs. – CMCDragonkai Oct 05 '18 at 07:36
  • @CMCDragonkai: The action `-printf` works similarly to the `printf()` function in C with the main difference that the `%` directives have a different meaning. The action is invoked for every file found. This means that `-printf x` will print the character `x` for every file found (try it!) and `-printf '\0'` will print the character NULL (ASCII code 0) for every file found. `-printf '\0'` has no special meaning. Both will work the same in the example with `wc -c` in this answer. – pabouk - Ukraine stay strong Oct 05 '18 at 09:14
  • @CMCDragonkai: Now I realized that you probably confused the action with `-print0`. Actions `-print` and `-print0` take no arguments they just print the file name followed either by `\n` or `\0`. `-print` is equivalent to `-printf '%p\n'`. `-print0` is equivalent to `-printf '%p\0'` – pabouk - Ukraine stay strong Oct 05 '18 at 09:29
  • Counting hard links doesn't always work. [Many filesystems don't have `.` and `..` as hard links](https://unix.stackexchange.com/q/748182/44425), for example [Btrfs always sets it to 1](https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/Project_ideas.html#Track_link_count_for_directories) – phuclv Jun 11 '23 at 04:59
  • @phuclv Hard links are not mentioned here. Did you think of inodes? --- Inodes are not hard links. Hard link is a normal directory entry (exactly like the originally created directory entry) pointing to an inode. --- Note: Because `df -i` counts inodes, it will not count an inode multiple times if it has multiple hard links pointing to it. So you can have more directory entries than inodes. – pabouk - Ukraine stay strong Jun 12 '23 at 14:18
  • indeed I was wrong. Counting inodes seems plausible – phuclv Jun 12 '23 at 17:02
5

I have written ffcnt for exactly that purpose. It retrieves the physical offset of directories themselves with the fiemap ioctl and then scheduling the directory traversal in multiple sequential passes to reduce random access. Whether you actually get a speedup compared to find | wc depends on several factors:

  • filesystem type: filesystems such as ext4 which support the fiemap ioctl will benefit most
  • random access speed: HDDs benefit far more than SSDs
  • directory layout: the higher the number of nested directories, the more optimization potential

(re)mounting with relatime or even nodiratime may also improve speed (for all methods) when the accesses would otherwise cause metadata updates.

the8472
  • 505
  • 6
  • 14
  • 1
    That last sentence is a worthwhile tip! I think the link to your program would be improved if you added a summary of how it works. We prefer answers that are complete in themselves, in case anything bad happens to the linked resource (but keep the link as well, of course). – Toby Speight Feb 14 '17 at 09:40
0

Use fd instead. It's a fast alternative to find that traverses folders in parallel

$ time find ~ -type f 2>/dev/null | wc -l
  445705
find ~ -type f 2> /dev/null  0.84s user 13.57s system 51% cpu 28.075 total
wc -l  0.03s user 0.02s system 0% cpu 28.074 total

$ time fd -t f -sHI --show-errors . ~ 2>/dev/null | wc -l
  445705
fd -t f -sHI --show-errors . ~ 2> /dev/null  2.66s user 14.81s system 628% cpu 2.780 total
wc -l  0.05s user 0.05s system 3% cpu 2.779 total

As you can see, to match the options of find in fd you'll need -sHI --show-errors. By default fd skips hidden files/folders and .gitignore and also doesn't print out permission errors so it's even far faster than this.

It's possible to tune this further by printing only a new line instead of piping the whole path. In find you can achieve that with -printf '\n'. This isn't currently supported on fd but it's a feature being requested


Note that in Ubuntu due to name clashing you'll need to use fdfind instead of fd. You can just alias fd=fdfind to overcome the longer name. And this obviously the above command won't work for file names containing \n. You'll need to fix it like this

fd -t f -sHI . ~ | tr '\0\n' '\n\0' | wc -l

Another nice thing about fd is that when running it in an interactive terminal you'll also get nice colorized texts unlike the output from find.

phuclv
  • 26,555
  • 15
  • 113
  • 235
0

Actually, on my system (Arch Linux) this command

   ls -A | wc -l

is faster than all of the above:

   $ time find . | wc -l
  1893

   real    0m0.027s
   user    0m0.004s
   sys     0m0.004s
   $ time find . -printf \\n  | wc -l
   1893

   real    0m0.009s
   user    0m0.000s
   sys     0m0.008s
   $ time find . -printf x  | wc -c
   1893

   real    0m0.009s
   user    0m0.000s
   sys     0m0.008s
   $ time ls -A | wc -l
   1892

   real    0m0.007s
   user    0m0.000s
   sys     0m0.004s
MariusMatutiae
  • 46,990
  • 12
  • 80
  • 129
  • I think the problem with ls though is that it often returns something like `/bin/ls: Argument list too long` if you use globbing, but then again it can operate recursively like find also, so maybe that is something to consider, don't use find if not needed. – MattPark Dec 17 '13 at 14:09
  • 2
    It seems so late (many years) to comment about it, but `ls -A` list only the files in the current directory while `find` without `-maxdepth 1` argument will make a recursive search through all subdirectories. – Luciano Dec 09 '19 at 20:42
  • 1
    to get the same output as `find` you'll need `ls -AR` which I doubt is faster than `find`. `ls -A` is obviously fast because it only opens a single folder and list all the files in it – phuclv Jun 12 '23 at 17:43