1

We have BBB based custom board, It has 256MB ram and 4GB or eMMC. We are using Linux-3.12 and on eMMC we have ext4 partitions.

I am writing a script which runs periodically and checks for file system errors and if partitions are not mounted I am trying to correct error using e2fsck.
Initially I was using e2fsck -n /dev/mmcblk0pN (N is partition number) to check for error in the file system partition.
However above command started giving wrong result when partition is mounted and files are being created on the partition.

Now I needed an alternative to check file system error,
One the option is to use tune2fs -l command on that partition check for Filesystem state field.

Now I am not sure if this field is reliable for checking file system errors or not ? And what possible values this field can have ? I have seen its values clean, clean with errors and not clean but I don't get more info from man page.

So, Is tune2fs -l /dev/mmcblk0pN | grep “Filesystem state” | grep “error” reliable to detect file system errors ? Any other better option to check file system errors in the partition ?

Any suggestion/pointers/information ?

ART
  • 375
  • 2
  • 4
  • 16

1 Answers1

2

"Tune2fs -l" will tell you if the kernel has noticed file system corruption issues while it is running. For example, if you ask ext4 to delete a file, and ext4 discovers that some of the blocks in that file were already marked as deallocated, that means that the allocation bitmap is corrupt. Note that the allocaiton bitmap was already corrupt at the time when ext4 discovered it. In fact, it could have been corrupt for days or weeks, and if you had been writing new files, it's possible that ext4 might have allocated blocks for new files that were in used for older files, and the user may have lost data as a result.

The only way to reliablly say for certain whether or not a file system is consistent or might have some amount of corruption is to run e2fsck on it. Doing this requires that either the file system be unmounted, or creating a read-only snapshot. (If you are using LVM, you can create a read-only snapshot, check the the read-only snapshot, and then if the file system is found to be corrupt, you can either reboot the system and let e2fsck fix the file system, or send e-mail to the system administrator to schedule downtime to fix the file system.)

All of this being said, if the file system has gotten corrupt, it's likely because of a hardware issue as the most common case. It's possible that it could be because of a kernel bug, although I do periodically run regression tests on the stable kernels, not just on upstream, and we haven't had a fs corruption problem in a very long time. It is possible that there might be a memory corruption bug in a device driver, and either (a) the device driver isn't upstream, and the hardware vendor didn't do proper quality control, or (b) the bug was fixed upstream, and even pushed to the latest stable kernel, but the device kernel wasn't taking updates from the stable kernel series.

Note that if you are looking to see if the file system was found to be corrupt because the kernel tripped over something blatently wrong, you don't have to just scrape dmesg or /var/log/messages. You can also try reading the file /sys/fs/ext4//first_error_time. If that file contains a non-zero value, that will tell you the time (using the Unix epoch) that a file system corruption was detected by the kernel. The errors_count file in that directory will tell you how many file system corruptions have been detected (but that can just be the system tripping over the same problem over and over again). Also of interest is, if you want to test how your system is handling file system errors being detected by the kernel, you can try writing a string to the trigger_fs_error file --- e.g., echo "test error" > /sys/fs/ext4/sda1/trigger_fs_error"

Finally, please take a look at the errors behaviour knob which you can set in tune2fs. It may be, if you want to really make sure that more damage isn't done after a file system corruption issue has been detected, that you want to configure the file system to remount itself read-only when a problem is found --- or maybe just force a reboot, so that e2fsck can be run during the boot sequence to fix a problem before (even more) user data gets corrupted or lost.

Theodore Ts'o
  • 901
  • 4
  • 5
  • Thank you very much for detailed reply Theodore. Currently we are using busybox SysVInit and it doesn't inherent `fsck` check at boot up. We don't have `filesystem_check` daemon like `systemd` So now only thing we could do is write a script and run it at startup and periodically during normal system run. At boot-up script would check all the ext4 partitions and corrects error using `e2fsck -y( Or e2fsck -p)`, and reboot if required. At normal operation it would do same thing but won't correct error but just report it. – ART Oct 24 '16 at 06:03
  • I have two questions, 1) how can we reliably check file system error on mounted an unmounted partitions ? As you also pointed out that partition should be unmounted and I also tested when partition is mounted and writes are going on it, `e2fsck` is not able to give right result. 2) Can unmounted partitions will have file system errors by any chance ? (Hardware badblocks can be culprit, but I am unable to think of any other reason). – ART Oct 24 '16 at 14:17
  • E2fsck can not give a reliable result when checking a mounted file system. The one exception is if you are using LVM and you create a read-only snapshot. That read-only snapshot can be checked reliably (but you can't fix the running file system). All of this I described above. – Theodore Ts'o Oct 25 '16 at 05:14
  • Unmounted file systems can have errors due to (a) hardware problems, (b) kernel bugs, (c) unclean shutdowns if the flash storage is not power failure certified. Not all flash devices are guaranteed to Do The Right Thing if they receive a power fail. – Theodore Ts'o Oct 25 '16 at 05:15
  • Thank you for reply Theodore. In that case how anyone monitor file system errors in mounted partitions and take an action based on it? can't it not be monitored? We don't want file system error to reach a state where it is remounted as readonly and user won't be able to perform any operation on it. Is there any specific certification I need to check in eMMC datasheet which will tell if it is power failure certified? – ART Oct 25 '16 at 05:59
  • If you are a handset manufacturer, and are buying eMMC devices hundreds of thousands of devices at a time, you can usually ask your supplier these questions, and the really serious vendors will do their own power failure testing. Or, you can not use removeable batteries and design your device to shutdown cleanly when power starts getting low. Perfection can't be done on the cheap. It means using ECC memory to prevent cosmic rays from corrupting memory before disk buffers can be written, etc. What is your use case? Is it life/mission-critical? Avionics? Self-driving car? – Theodore Ts'o Oct 25 '16 at 13:08
  • On SSD's, it's on the data sheets, and it's the difference between consumer grade SSD's that might go for low hundreds of dollars, and enterprise grade SSD's that might go almost a thousand dollars or more. I don't know how many eMMC devices are power fail certified. There is unfortunate tendency called "race to the bottom", where manufacturers are trying to shave fractions of a penny off their bill of materials (BOM) costs which has this very unfortunate tendency to result in quality not necessarily being highest priority. Yes, I'm comparing retail prices with wholesale prices, but.... – Theodore Ts'o Oct 25 '16 at 13:13
  • Also note that most handset manufacturers cheap out on software engineering costs, so they aren't keeping up with the latest bug fixes from the stable kernel series. So to be fair it's not just a hardware issue. That's why I'm asking what your use case is; it's unusual for handset manufacturers to care. Unfortunately, for far too many of them, as long as they don't catch fire and blow up on airplanes, it's considered a win.... (and it's not just their fault; most people aren't willing to pay extra for quality, so we get what we deserve) – Theodore Ts'o Oct 25 '16 at 13:19
  • If you are using LVM, here's how you can check a mounted file system. It requires making a read-only snapshot, and most embedded / handset systems do *not* use LVM. (They could, if they cared, I suppose....) https://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/tree/contrib/e2croncheck – Theodore Ts'o Oct 25 '16 at 13:24
  • Thank you for reply, we are NOT into mission-critical or avionics or self-driving car. It is non-life-safety critical embedded application use case. However we are seeing these file system errors in our test devices. These errors seems to be emerging from fragmentation issue(or at-least that's what we can understand from kernel messages). I had posted it in multiple forums e.g. BBB forum: https://groups.google.com/forum/#!topic/beagleboard/L7piqfHiyO8 . However we haven't found the root cause OR appropriate solution for the same(Except setting min_free_kbytes to 16M). – ART Oct 25 '16 at 16:12
  • So we are trying a solution to recover from these file system errors if it happens. Similar question I had posted here also http://superuser.com/questions/1138443/how-to-reduce-unmovable-and-and-unevictable-pages-in-linux. – ART Oct 25 '16 at 16:12
  • It looks like your main problem is you're trying to use a system with only 256 MiB of RAM, and the eMMC driver is trying to allocate an order 3 chunk of memory (8 contiguous 4k pages), and is failing. This is causing I/O errors which is corrupting the file system. You may just simply be doing more than can be supported by your system --- or it may be that some kernel module or driver you have loaded as a memory leak. – Theodore Ts'o Oct 26 '16 at 03:56
  • I will check kernel memory leak also and update it. – ART Oct 26 '16 at 12:29
  • It seems that we have memory leak in edma driver(Linux 3.12) of am335x. We are trying to get help from TI. Meanwhile I had one doubt is `e2fsck -y` destructive operation? for my script I am planning to use `e2fsck -p` as manual says it safely fixes issue without human intervention. What kind error would need human intervention, is it when e2fsck thinks that operation would be destructive ? – ART Dec 07 '16 at 14:14