Why does Debian crashes with OOM

Question

I recently upgraded my server from Debian squeeze i386 to wheezy amd64 by reinstalling and reconfigurating. Additionally I wanted to be able to start virtual guests, so I installed XEN, too.

I got then the problem that from time to time OOM killer destroyed multiple processes on my Dom0. I then restarted and disabled several services (like apache2, mysql, postgresql,...). Now it seems that no processes are destroyed anymore (unsure, as it does not happen regularely but in a stochastic fashion). BUT: If I put some high load on the machine (access to encrypted filesystem), the OOM killer is activated.

Unfortunately the system is not usable anymore after the problem occured. So I cannot access via ssh to investigate. Also a physical investigation via console hangs most of the times.

I have a atop daemon running every minute so I can see the memory and swap cunsumption before the crash: The RAM is 1GB (880MB) in total (staically allocated to Dom0, no ballooning) where aprox. 440 MB are cache. Some MB are buffers and around 20MB are free. The swap is 25GiB in total and completely free.

What I do not understand: Why does the kernel not kill some of the cache if more RAM is needed. It is cache, so all that could happen is a performance problem but the system would remain stable. This way the system crashes. Also why are unneeded memory sections used by other programs not put on swap? There should be enough space to do what ever you want.

I sometime saw a message on the console that a task (jbod/raid5 or something similar) was blocking (?) for more than 120 secs. I am not sure if this is the cause or the impact of the OOM problem.

Now my questions are:

Could it be a XEN issue?
Could it be a hardware issue? RAM or HD?
What can I do to avoid future crashes?

Edit: I just tried to reproduce the error. It did crash but this time (I do not exactly now if there were other errors in other situations) the program that hung was xenwatch. So no program accessing the hd.

JBOD and RAID5 are both related to storage. Could it possibly be that the storage is dying (either controller or disk(s)), the system detects this and thus won't use the swap even though it's supposed to be available? Storage timing out is *never* a good sign. — user, Jul 05 '13 at 11:17
In total there are 4GB and I want the possibility to update. — Christian Wolf, Jul 05 '13 at 11:38
@MichaelKjörling I have 3 Soft-RAID Devices: root, an PVS (both on 3 disks) and a backup on 2 disks. The Swap is NOT under LVM but a physical partition on the 3 main disks. I had an issue with one of the two backup disks. So I guessed it was this one causing the problems. Is there a pssibility to tell that the storage died? — Christian Wolf, Jul 05 '13 at 11:42
Check the S.M.A.R.T. info for your drives `sudo smartctl -a /dev/sdX` — terdon, Jul 05 '13 at 13:31
@terdon Looking for what? Any errors? There are but on disks that are not part of any RAID. Only the backup. (see edit in main question) — Christian Wolf, Jul 05 '13 at 14:00
Just as a way of seeing if any of disks that serve your swap are dying which might stop swap from being used. — terdon, Jul 05 '13 at 14:03
@terdon All disks having swap partitions report no errors ecxept for one in the past. — Christian Wolf, Jul 05 '13 at 14:07

Why does Debian crashes with OOM

0 Answers0