ecryptfs and many many small files - bad performance?

Question

I have a folder with a few hundred thousand small files, totalling about 14 GB of data. This is a folder in my ecryptfs encrypted home directory.

Doing a "du -sh folder" takes over 9 minutes. Doing a cp -ral to a non-encrypted location takes an hour and 15 minutes. CPU load during this time is mostly IO-bound (80% wa in top)

Doing a "du -sh encryptedfolder" only takes 15 seconds and a cp -ral to the same location takes but 80 seconds. 'encryptedfolder' is the folder in /home/.ecryptfs/myname/.Private that contains the encrypted files.

I am baffled how this performance hit happens. This folder is backuped nightly via rsync, which now takes more than two hours. Before I switched to ecryptfs, I used truecrypt and backup ran in 12 minutes.

Why is ecryptfs so abysmal slow in this scenario? The du -sh and cp -ral operations do not involve any decryption of file contents, just finding the right filename. Is there some way to speed this up?

P.S.: This runs on Ubuntu 11.04

score 3 · Accepted Answer · answered Mar 05 '12 at 22:07

There are a couple contributing factors here.

Getting a list of all filenames in a directory requires decoding, parsing and decrypting the lower filenames.
The stat() calls from du cause a lookup, which requires allocating an eCryptfs inode, reading part of the lower file metadata, checking to make sure it is an eCryptfs file and then parsing out the unencrypted file size to set the eCryptfs inode's i_size field. Keep in mind that reading the metadata from the lower filesystem involves reading a page into the lower filesystem's page cache.

Because of the design of eCryptfs, it has some unfortunate overhead when dealing with a large number of files. I'm sure there are some improvements/enhancements to be made, despite the design, but optimising this part of the code has not previously been a focus of mine.

Okay, this is a bit disappointing. But I'll just move the folder out of my home directory and on to a TrueCrypt secured disk. — Guy, Mar 05 '12 at 23:11

score 0 · Answer 2 · answered Mar 05 '12 at 21:37

0

The simple answer is it isn't. The performance hit isn't from encryptfs being slow, it's from the need to allocate very large numbers of inodes and perform disk maintenance to put all of the metadata associated with the files on disk one-by-one.

If the folder is backuped nightly you might find it more useful to first "tar" the entire directory, compress the resulting file and then encrypt that (don't encrypt and then compress, because compression doesn't work on encrypted files). This way you'll have a backup that will be markedly smaller, and much faster to create and move around.

answered Mar 05 '12 at 21:37

SecurityMatt

3,140
16
21

(1) For du -sh there is no need to put anything on disk, this is purely reading. (2) Your suggestion with tar is not faster, because it still would have to go through ecryptfs (as that's where the folder is stored in), also I'd prefer to do incremental backups and not get a new huge file every day. – Guy Mar 05 '12 at 21:42
Reads still need to go through the inode structure (which is lots of indirection each requiring disk access and a reset of the encryption vectors). Your point about tar is valid, but if you want incremental backups you should probably be using SVN or GIT instead of taking a local copy – SecurityMatt Mar 05 '12 at 22:36
Thanks for the clarification. However I do not accept the premise that svn/git would be a replacement for backups. Ever. In this case I am not even sure it would work at all, or make things significantly faster (it still would need to go through all files to see what has changed). – Guy Mar 05 '12 at 23:10
To justify my point: http://blog.codekills.net/2009/12/08/using-git-for-backup-is-asking-for-pain / http://ewout.name/2011/10/do-not-store-database-backups-in-git/ – Guy Mar 06 '12 at 08:37
This is the first time you've mentioned databases. You're right that you shouldn't back a database up in GIT, but for normal files which change infrequently (such as source-code) it's still a good choice. For database backups you'd probably be better off taking incremental diffs of the database, compressing (and maybe encrypting) the result and storing those off site. Remember that a backup onsite isn't a backup at all when your site burns to the ground. – SecurityMatt Mar 06 '12 at 10:48
I do not have a database, this is just an example link. The other one does not talk about databases but lists other issues. And of course I do have off-side backups :) Git (or similar) just is not intended for anything but source code. In this particular case, it also would not help the performance - after all having to do an e.g. "git commit ." would still require git to scan the whole structure - no difference in that regards. – Guy Mar 06 '12 at 12:51
Git and SVN are intended for dealing with large numbers of small files that don't change very often. They don't really care about whether those files contain source code or anything else. Also GIT and SVN don't scan the whole structure for changes - that's their whole point. – SecurityMatt Mar 06 '12 at 15:43
How would git or svn know which file was added and which one was modified unless checking every file, i.e. the modification timestamp. This has the exact same overhead as accessing the filesize metadata with du (except maybe being able to ignore some directories, but in my use case it would have to go into at at least the main big folders). And as outlined in the first link (http://blog.codekills.net/2009/12/08/using-git-for-backup-is-asking-for-pain) neither git nor svn backups metadata except timestamps and access bits, which makes it sub-ideal for generic backups. – Guy Mar 06 '12 at 20:51
I don't know how it does it, but when I update code at my company which has literally millions of source files containing billions of lines of source code it detects changes instantly and certainly isn't reading in all of the 500GB of files to see if each one has changed. Look, I don't want to be unkind, but it seems to me that you've already decided I'm wrong, so I'm not sure there's really any reason in continuing this discussion. – SecurityMatt Mar 06 '12 at 23:51
Sorry, you are right, I am not convinced you are right, mostly because you only say that it is so, but do not have any arguments why. Regarding git file access, cmp. questions on stackoverflow e.g. http://stackoverflow.com/questions/4075528/what-algorithm-git-uses-to-detect-the-changes-on-your-working-tree which is the same as du -sh does (are you on ecryptfs?). And backups using git do not backup all metadata (e.g. ownership) and thus make it unsuitable for this purpose. Also deleting older backups (daily backups from 5 years ago?) etc are not easily possible. – Guy Mar 07 '12 at 06:47

ecryptfs and many many small files - bad performance?

2 Answers2