Revisions to Centos server not using SWAP properly and getting OOM

Added update after question update

edited Mar 15, 2012 at 20:08

361
3
10

I usually sort by memory ("M" in top) to troubleshoot these kinds of things--that shows you the amount of real memory that each process is using (and touching frequently enough to keep it off the least-recently-used queue for being swapped).

VIRT = RES + SWAP

Another thing to check is whether /tmp is a tmpfs file system and if something is writing a lot of data there.

I am actually a little confused by what I'm seeing. Is this sar output over the interval when your outage occurred or just the default output? And the top output is from a totally different time, 14:32?

Also, it's not really using swap at the time you took these stats because it doesn't need to--nearly 3G of your memory is currently being used as disk cache ("kbcached") and you only have kbmemused - kbcached + kbbuffers = 664072KiB (648MiB) [at 04:40:01] in use by actual processes.

Because no process image is using much memory itself but yet the oom-killer started, then I would guess that something started performing a lot of file I/O and started dirtying pages faster than could be written to disk. I'm not really sure that should trigger the oom-killer though.

None of these dirty pages would go to swap, because it's about as easy to write the content of the file itself out as it is to write the data to swap.

The obvious guess is that mysqld was doing this, although I would suspect that it would open its files with O_DIRECT, which suggests to the kernel to minimize effects on the cache (with the premise that the DB server is doing its own caching).

Update

Based on your free output from update #2, the answer to the question in your topic is that it's using swap just fine; something just used all of it. The other data you provided is normal for a system that has recently boot.

Update 2

I mentioned mysql below, but I would be surprised is that's the culprit, honestly. I would suspect spamd, the CPanel processes or web applications running within Apache first.

I have also been assuming that you're running a reasonably current distro without any tweaking of system tunables and that you're current on security patches. There was a BIND exploit in the last few months that resulted in a DoS but I cannot recall if the exploit triggered memory exhaustion or something else. I have also read of CPanel exploits recently, but I don't know how current those were.

I usually sort by memory ("M" in top) to troubleshoot these kinds of things--that shows you the amount of real memory that each process is using (and touching frequently enough to keep it off the least-recently-used queue for being swapped).

VIRT = RES + SWAP

Another thing to check is whether /tmp is a tmpfs file system and if something is writing a lot of data there.

I am actually a little confused by what I'm seeing. Is this sar output over the interval when your outage occurred or just the default output? And the top output is from a totally different time, 14:32?

Also, it's not really using swap at the time you took these stats because it doesn't need to--nearly 3G of your memory is currently being used as disk cache ("kbcached") and you only have kbmemused - kbcached + kbbuffers = 664072KiB (648MiB) [at 04:40:01] in use by actual processes.

Because no process image is using much memory itself but yet the oom-killer started, then I would guess that something started performing a lot of file I/O and started dirtying pages faster than could be written to disk. I'm not really sure that should trigger the oom-killer though.

None of these dirty pages would go to swap, because it's about as easy to write the content of the file itself out as it is to write the data to swap.

The obvious guess is that mysqld was doing this, although I would suspect that it would open its files with O_DIRECT, which suggests to the kernel to minimize effects on the cache (with the premise that the DB server is doing its own caching).

Update

Based on your free output from update #2, the answer to the question in your topic is that it's using swap just fine; something just used all of it. The other data you provided is normal for a system that has recently boot.

I usually sort by memory ("M" in top) to troubleshoot these kinds of things--that shows you the amount of real memory that each process is using (and touching frequently enough to keep it off the least-recently-used queue for being swapped).

VIRT = RES + SWAP

Another thing to check is whether /tmp is a tmpfs file system and if something is writing a lot of data there.

I am actually a little confused by what I'm seeing. Is this sar output over the interval when your outage occurred or just the default output? And the top output is from a totally different time, 14:32?

Also, it's not really using swap at the time you took these stats because it doesn't need to--nearly 3G of your memory is currently being used as disk cache ("kbcached") and you only have kbmemused - kbcached + kbbuffers = 664072KiB (648MiB) [at 04:40:01] in use by actual processes.

Because no process image is using much memory itself but yet the oom-killer started, then I would guess that something started performing a lot of file I/O and started dirtying pages faster than could be written to disk. I'm not really sure that should trigger the oom-killer though.

None of these dirty pages would go to swap, because it's about as easy to write the content of the file itself out as it is to write the data to swap.

The obvious guess is that mysqld was doing this, although I would suspect that it would open its files with O_DIRECT, which suggests to the kernel to minimize effects on the cache (with the premise that the DB server is doing its own caching).

Update

Based on your free output from update #2, the answer to the question in your topic is that it's using swap just fine; something just used all of it. The other data you provided is normal for a system that has recently boot.

Update 2

I mentioned mysql below, but I would be surprised is that's the culprit, honestly. I would suspect spamd, the CPanel processes or web applications running within Apache first.

I have also been assuming that you're running a reasonably current distro without any tweaking of system tunables and that you're current on security patches. There was a BIND exploit in the last few months that resulted in a DoS but I cannot recall if the exploit triggered memory exhaustion or something else. I have also read of CPanel exploits recently, but I don't know how current those were.

Added update after question update

Source Link

edited Mar 15, 2012 at 20:01

Wil Cooley

361
3
10

I usually sort by memory ("M" in top) to troubleshoot these kinds of things--that shows you the amount of real memory that each process is using (and touching frequently enough to keep it off the least-recently-used queue for being swapped).

VIRT = RES + SWAP

Another thing to check is whether /tmp is a tmpfs file system and if something is writing a lot of data there.

I am actually a little confused by what I'm seeing. Is this sar output over the interval when your outage occurred or just the default output? And the top output is from a totally different time, 14:32?

Also, it's not really using swap at the time you took these stats because it doesn't need to--nearly 3G of your memory is currently being used as disk cache ("kbcached") and you only have kbmemused - kbcached + kbbuffers = 664072KiB (648MiB) [at 04:40:01] in use by actual processes.

Because no process image is using much memory itself but yet the oom-killer started, then I would guess that something started performing a lot of file I/O and started dirtying pages faster than could be written to disk. I'm not really sure that should trigger the oom-killer though.

None of these dirty pages would go to swap, because it's about as easy to write the content of the file itself out as it is to write the data to swap.

The obvious guess is that mysqld was doing this, although I would suspect that it would open its files with O_DIRECT, which suggests to the kernel to minimize effects on the cache (with the premise that the DB server is doing its own caching).

Update

Based on your free output from update #2, the answer to the question in your topic is that it's using swap just fine; something just used all of it. The other data you provided is normal for a system that has recently boot.

I usually sort by memory ("M" in top) to troubleshoot these kinds of things--that shows you the amount of real memory that each process is using (and touching frequently enough to keep it off the least-recently-used queue for being swapped).

VIRT = RES + SWAP

Another thing to check is whether /tmp is a tmpfs file system and if something is writing a lot of data there.

I am actually a little confused by what I'm seeing. Is this sar output over the interval when your outage occurred or just the default output? And the top output is from a totally different time, 14:32?

Also, it's not really using swap at the time you took these stats because it doesn't need to--nearly 3G of your memory is currently being used as disk cache ("kbcached") and you only have kbmemused - kbcached + kbbuffers = 664072KiB (648MiB) [at 04:40:01] in use by actual processes.

Because no process image is using much memory itself but yet the oom-killer started, then I would guess that something started performing a lot of file I/O and started dirtying pages faster than could be written to disk. I'm not really sure that should trigger the oom-killer though.

None of these dirty pages would go to swap, because it's about as easy to write the content of the file itself out as it is to write the data to swap.

The obvious guess is that mysqld was doing this, although I would suspect that it would open its files with O_DIRECT, which suggests to the kernel to minimize effects on the cache (with the premise that the DB server is doing its own caching).

I usually sort by memory ("M" in top) to troubleshoot these kinds of things--that shows you the amount of real memory that each process is using (and touching frequently enough to keep it off the least-recently-used queue for being swapped).

VIRT = RES + SWAP

Another thing to check is whether /tmp is a tmpfs file system and if something is writing a lot of data there.

I am actually a little confused by what I'm seeing. Is this sar output over the interval when your outage occurred or just the default output? And the top output is from a totally different time, 14:32?

Also, it's not really using swap at the time you took these stats because it doesn't need to--nearly 3G of your memory is currently being used as disk cache ("kbcached") and you only have kbmemused - kbcached + kbbuffers = 664072KiB (648MiB) [at 04:40:01] in use by actual processes.

Because no process image is using much memory itself but yet the oom-killer started, then I would guess that something started performing a lot of file I/O and started dirtying pages faster than could be written to disk. I'm not really sure that should trigger the oom-killer though.

None of these dirty pages would go to swap, because it's about as easy to write the content of the file itself out as it is to write the data to swap.

The obvious guess is that mysqld was doing this, although I would suspect that it would open its files with O_DIRECT, which suggests to the kernel to minimize effects on the cache (with the premise that the DB server is doing its own caching).

Update

Based on your free output from update #2, the answer to the question in your topic is that it's using swap just fine; something just used all of it. The other data you provided is normal for a system that has recently boot.

Added more info after thinking about it more

Source Link

edited Mar 15, 2012 at 19:25

Wil Cooley

361
3
10

I usually sort by memory ("M" in top) to troubleshoot these kinds of things--that shows you the amount of real memory that each process is using (and touching frequently enough to keep it off the least-recently-used queue for being swapped).

VIRT = RES + SWAP

Another thing to check is whether /tmp is a tmpfs file system and if something is writing a lot of data there.

I am actually a little confused by what I'm seeing. Is this sar output over the interval when your outage occurred or just the default output? And the top output is from a totally different time, 14:32?

Also, it's not really using swap at the time you took these stats because it doesn't need to--nearly 3G of your memory is currently being used as disk cache ("kbcached") and you only have kbmemused - kbcached + kbbuffers = 664072KiB (648MiB) [at 04:40:01] in use by actual processes.

Because no process image is using much memory itself but yet the oom-killer started, then I would guess that something started performing a lot of file I/O and started dirtying pages faster than could be written to disk. I'm not really sure that should trigger the oom-killer though.

None of these dirty pages would go to swap, because it's about as easy to write the content of the file itself out as it is to write the data to swap.

The obvious guess is that mysqld was doing this, although I would suspect that it would open its files with O_DIRECT, which suggests to the kernel to minimize effects on the cache (with the premise that the DB server is doing its own caching).

I usually sort by memory ("M" in top) to troubleshoot these kinds of things--that shows you the amount of real memory that each process is using (and touching frequently enough to keep it off the least-recently-used queue for being swapped).

VIRT = RES + SWAP

Another thing to check is whether /tmp is a tmpfs file system and if something is writing a lot of data there.

I usually sort by memory ("M" in top) to troubleshoot these kinds of things--that shows you the amount of real memory that each process is using (and touching frequently enough to keep it off the least-recently-used queue for being swapped).

VIRT = RES + SWAP

Another thing to check is whether /tmp is a tmpfs file system and if something is writing a lot of data there.

I am actually a little confused by what I'm seeing. Is this sar output over the interval when your outage occurred or just the default output? And the top output is from a totally different time, 14:32?

Also, it's not really using swap at the time you took these stats because it doesn't need to--nearly 3G of your memory is currently being used as disk cache ("kbcached") and you only have kbmemused - kbcached + kbbuffers = 664072KiB (648MiB) [at 04:40:01] in use by actual processes.

Because no process image is using much memory itself but yet the oom-killer started, then I would guess that something started performing a lot of file I/O and started dirtying pages faster than could be written to disk. I'm not really sure that should trigger the oom-killer though.

None of these dirty pages would go to swap, because it's about as easy to write the content of the file itself out as it is to write the data to swap.

The obvious guess is that mysqld was doing this, although I would suspect that it would open its files with O_DIRECT, which suggests to the kernel to minimize effects on the cache (with the premise that the DB server is doing its own caching).

Source Link

answered Mar 15, 2012 at 18:54

Wil Cooley

361
3
10

Loading

Stack Exchange Network

Return to Answer