Centos server not using SWAP properly and getting OOM

Question

Recently I've been having some serious memory issues with my server. Just the other day, my server became completely unresponsive, and oom-killer started killing services at random (httpd, php, etc). I couldn't even SSH into my server, but I was able to PING it.

I did look at the kernel messages log, but there wasn't any clear indication as to what was causing the memory problem - all I could see was all the oom-killer messages.

sar -r command:

03/15/2012 12:00:01 AM kbmemfree kbmemused %memused kbbuffers kbcached kbswpfree kbswpused %swpused kbswpcad 12:10:01 AM 2881812 582380 16.81 26652 250192 4192944 0 0.00 0 12:20:01 AM 2883600 580592 16.76 27104 250196 4192944 0 0.00 0 12:30:01 AM 2878576 585616 16.90 27656 250320 4192944 0 0.00 0 12:40:01 AM 2851856 612336 17.68 28312 271540 4192944 0 0.00 0 12:50:01 AM 2843560 620632 17.92 28968 274468 4192944 0 0.00 0 01:00:01 AM 2843892 620300 17.91 29440 274644 4192944 0 0.00 0 01:10:01 AM 22868 3441324 99.34 60764 2947884 4192936 8 0.00 8 01:20:01 AM 13836 3450356 99.60 62064 2882544 4192844 100 0.00 92 01:30:03 AM 14024 3450168 99.60 7820 3040976 4192844 100 0.00 0 01:40:01 AM 18600 3445592 99.46 18720 3039152 4192844 100 0.00 0 01:50:01 AM 25352 3438840 99.27 20048 3034584 4192844 100 0.00 0 02:00:01 AM 22572 3441620 99.35 20872 3036896 4192844 100 0.00 0 02:10:01 AM 21408 3442784 99.38 21776 3038236 4192844 100 0.00 0 02:20:01 AM 23240 3440952 99.33 23168 3032372 4192844 100 0.00 0 02:30:01 AM 72392 3391800 97.91 25100 2981488 4192844 100 0.00 0 02:40:01 AM 70876 3393316 97.95 25824 2981756 4192844 100 0.00 0 02:50:01 AM 74200 3389992 97.86 26464 2981860 4192844 100 0.00 0 03:00:01 AM 64980 3399212 98.12 32616 2982240 4192844 100 0.00 0 03:10:01 AM 63704 3400488 98.16 33564 2984268 4192844 100 0.00 0 03:20:01 AM 59564 3404628 98.28 34592 2988936 4192844 100 0.00 0 03:30:01 AM 53972 3410220 98.44 35740 2992484 4192844 100 0.00 0 03:40:01 AM 89120 3375072 97.43 36472 2956088 4192844 100 0.00 0 03:50:01 AM 88788 3375404 97.44 36920 2956324 4192844 100 0.00 0 04:00:01 AM 78540 3385652 97.73 37740 2964452 4192844 100 0.00 0 04:10:01 AM 21720 3442472 99.37 106636 2892836 4192844 100 0.00 0 04:20:01 AM 22796 3441396 99.34 107172 2890796 4192844 100 0.00 0 04:30:01 AM 30604 3433588 99.12 107812 2884644 4192844 100 0.00 0 04:40:01 AM 32744 3431448 99.05 108568 2875944 4192844 100 0.00 0

Here is top sorted by swapped size:

top - 14:32:01 up 15:37, 1 user, load average: 0.10, 0.10, 0.04 Tasks: 110 total, 3 running, 107 sleeping, 0 stopped, 0 zombie Cpu(s): 0.5%us, 0.3%sy, 0.0%ni, 98.4%id, 0.7%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 3464192k total, 2663384k used, 800808k free, 140796k buffers Swap: 4192944k total, 100k used, 4192844k free, 2073748k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP COMMAND 1975 mysql 15 0 222m 43m 4652 S 0.0 1.3 0:11.82 178m mysqld 1859 named 22 0 161m 5228 1948 S 0.0 0.2 0:00.04 156m named 2144 root 18 0 143m 47m 1072 S 0.0 1.4 0:00.00 95m spamd 2119 root 15 0 143m 49m 2628 S 0.0 1.5 0:01.17 94m spamd 2161 root 15 0 93372 1280 936 S 0.0 0.0 0:00.01 89m pure-ftpd 2163 root 18 0 91016 976 804 S 0.0 0.0 0:00.01 87m pure-authd 20035 root 15 0 91800 3096 2408 S 0.0 0.1 0:00.00 86m sshd 19432 root 15 0 92232 3656 2900 R 0.0 0.1 0:00.00 86m sshd 2377 root 19 0 93268 14m 1940 S 0.0 0.4 0:00.00 76m cpdavd 2380 root 15 0 87824 11m 1520 S 0.0 0.3 0:00.07 74m cpsrvd-ssl 3115 root 15 0 74832 1168 584 S 0.0 0.0 0:00.05 71m crond 18548 root 18 0 73624 3036 236 S 0.0 0.1 0:00.00 68m httpd 19713 nobody 18 0 73760 4460 1584 S 0.0 0.1 0:00.00 67m httpd 19712 nobody 15 0 73760 4484 1584 S 0.0 0.1 0:00.00 67m httpd 19709 nobody 18 0 73624 4460 1584 S 0.0 0.1 0:00.00 67m httpd 19508 nobody 15 0 73760 4600 1680 S 0.0 0.1 0:00.00 67m httpd 19162 nobody 15 0 73756 4640 1708 S 0.0 0.1 0:00.01 67m httpd 19154 nobody 15 0 73756 4656 1728 S 0.0 0.1 0:00.00 67m httpd 19157 nobody 15 0 73756 4696 1740 S 0.0 0.1 0:00.01 67m httpd 19327 nobody 15 0 73756 4700 1740 S 0.0 0.1 0:00.01 67m httpd 19163 nobody 15 0 73756 4768 1836 S 0.0 0.1 0:00.00 67m httpd 19164 nobody 15 0 73756 4788 1856 S 0.0 0.1 0:00.00 67m httpd 2145 root 18 0 73624 5740 2940 S 0.0 0.2 0:00.60 66m httpd 1911 root 20 0 65952 1276 1044 S 0.0 0.0 0:00.01 63m mysqld_safe

For some reason, it says that it's only using 100k SWAP, but that doesn't make any sense. Isn't VIRT the amount of SWAP being used by each process?

* Update *

Here is some more information on the file systems:

# df -T Filesystem Type 1K-blocks Used Available Use% Mounted on /dev/md2 ext3 468924192 17215692 427504176 4% / /dev/md1 ext3 2030672 58788 1867068 4% /tmp /dev/md0 ext3 101018 13414 82388 15% /boot tmpfs tmpfs 1732096 0 1732096 0% /dev/shm

* Update 2 *

Here is the free -m that I managed to run when the server was in this OOM state, yesterday:

 total used free shared buffers cached Mem: 3383 3372 10 0 0 6 -/+ buffers/cache: 3365 17 Swap: 4094 4094 0

Not that I know of - is there anything that you'd like me to check for in the dmesg buffer? — xil3
– xil3, Commented Mar 15, 2012 at 18:54

Wil Cooley · Accepted Answer · 2012-03-15 20:08:42Z

I usually sort by memory ("M" in top) to troubleshoot these kinds of things--that shows you the amount of real memory that each process is using (and touching frequently enough to keep it off the least-recently-used queue for being swapped).

VIRT = RES + SWAP

Another thing to check is whether /tmp is a tmpfs file system and if something is writing a lot of data there.

I am actually a little confused by what I'm seeing. Is this sar output over the interval when your outage occurred or just the default output? And the top output is from a totally different time, 14:32?

Also, it's not really using swap at the time you took these stats because it doesn't need to--nearly 3G of your memory is currently being used as disk cache ("kbcached") and you only have kbmemused - kbcached + kbbuffers = 664072KiB (648MiB) [at 04:40:01] in use by actual processes.

Because no process image is using much memory itself but yet the oom-killer started, then I would guess that something started performing a lot of file I/O and started dirtying pages faster than could be written to disk. I'm not really sure that should trigger the oom-killer though.

None of these dirty pages would go to swap, because it's about as easy to write the content of the file itself out as it is to write the data to swap.

The obvious guess is that mysqld was doing this, although I would suspect that it would open its files with O_DIRECT, which suggests to the kernel to minimize effects on the cache (with the premise that the DB server is doing its own caching).

Update

Based on your free output from update #2, the answer to the question in your topic is that it's using swap just fine; something just used all of it. The other data you provided is normal for a system that has recently boot.

Update 2

I mentioned mysql below, but I would be surprised is that's the culprit, honestly. I would suspect spamd, the CPanel processes or web applications running within Apache first.

I have also been assuming that you're running a reasonably current distro without any tweaking of system tunables and that you're current on security patches. There was a BIND exploit in the last few months that resulted in a DoS but I cannot recall if the exploit triggered memory exhaustion or something else. I have also read of CPanel exploits recently, but I don't know how current those were.

Just updated the question with the file system types - /tmp is ext3 — xil3
– xil3, Commented Mar 15, 2012 at 19:03
Hmm, isn't /dev/shm used for shared memory? Kinda confused as to why that is allocated 1732096 and /tmp is allocated 2030672. Is it considering the combination of those 2 as the SWAP? — xil3
– xil3, Commented Mar 15, 2012 at 19:17
The amount of memory a tmpfs file system uses is controllable at mount time. It usually defaults to 1/2 physical RAM, which seems about right for /dev/shm. /tmp doesn't count for anything since it's ext3. tmpfs doesn't count for the system-wide swap usage unless pages are actually used for the contents. — Wil Cooley
– Wil Cooley, Commented Mar 15, 2012 at 19:27
The sar and top I just did at the time of writing this question. Unfortunately, I didn't have any data from when the outages occurred. So you think the culprit might be mysql? Do you think it could be related to a PHP script that is running a large SQL query constantly? And can you recommend any way to try and track down exactly what is going on? — xil3
– xil3, Commented Mar 15, 2012 at 19:39
I also setup a memory logger to spit out details every 10 minutes into a file; so if it does occur again, I'm hoping that this will have some valid information. — xil3
– xil3, Commented Mar 15, 2012 at 19:40

Stack Exchange Network

Centos server not using SWAP properly and getting OOM

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

Centos server not using SWAP properly and getting OOM

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions