High Load, low CPU, Memory and Disk IO - Highend Server

Question

This issue has been bugging me for the past several days having spent over 40 hours investigating this issue intensively.

Effectively we run asterisk 1.4.42 which I understand is old, however is the last real stable asterisk version which works withour upstream providers in regards to fax (upgrading is not an option).

Now the issue is, we have the following spec server:

Dell Poweredge 1950

Quad Core Xeon 2.5Ghz E5420

8 GB ECC Ram

4 x 73GB SAS 10k RPM HDs

Dell PERC 5 RAID Controller in Raid 10

Centos 5.9 X64

Disk Formatting EXT3

Now the problem is, we are having very high server load on 100 concurrent calls in asterisk. I cannot figure it out. I have another server that is of similar spec but its Quad core2duo, raid 1, 2 x 250GB 7,200 RPM HDs and 8GB non ECC ram that is handling 200+ concurrent calls and is about 0.3 server load.

I am really to my end with this and cannot figure it out.

I have attached screen shots of top and iotop results

The screen shots show low CPU usage, Low Memory usage and 0% wait time on Disk IO

top - http://chostwales.com/images/hosted/Super-load.jpg

iotop - http://chostwales.com/images/hosted/HighDISKIO.jpg

Any help/ideas would be really really appreciated on this.

To clarify this is 100 concurrent calls with approx 1 new call every second. ( As mentioned above, I have servers of much less spec doing 10 new calls ever second and the load is hardly budging)

To clarify:

No Call Recording/Monitoring
Transcoding is about 30% of the calls. (However this would be CPU from understanding)
We are NOT running any PRI's

cat /proc/interrupts shows (No system utilisation currently)

[root@IS-21418 ~]# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 7855099 0 0 0 IO-APIC-edge timer 1: 3 0 0 0 IO-APIC-edge i8042 8: 1 0 0 0 IO-APIC-edge rtc 9: 0 0 0 0 IO-APIC-level acpi 12: 4 0 0 0 IO-APIC-edge i8042 66: 24 0 0 0 IO-APIC-level ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb4 74: 34 106102 0 0 IO-APIC-level uhci_hcd:usb3, uhci_hcd:usb5 82: 4143 50727 0 0 IO-APIC-level megasas 90: 123985 0 0 0 PCI-MSI eth0 NMI: 435 195 209 215 LOC: 7852754 7851976 7852615 7851820 ERR: 0 MIS: 0 [root@IS-21418 ~]# vmstat 1 20 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 7318888 23108 296540 0 0 125 61 1169 2581 2 3 93 1 0 0 0 0 7318708 23124 296524 0 0 8 280 9704 20440 7 6 87 0 0 0 0 0 7318820 23140 296768 0 0 128 280 9144 19752 2 5 93 0 0 0 0 0 7318820 23180 296728 0 0 0 1620 8162 16012 2 2 97 0 0 0 0 0 7318940 23208 296760 0 0 12 392 9729 22355 3 5 92 0 0 0 0 0 7318544 23216 296752 0 0 0 100 9679 20152 2 2 96 0 0 0 0 0 7317852 23232 296836 0 0 8 332 9753 21294 8 9 84 0 0 0 0 0 7317720 23240 296828 0 0 4 160 9702 22166 3 3 95 0 0 0 0 0 7317612 23248 296908 0 0 0 192 9643 20168 1 4 95 0 0 0 0 0 7317340 23256 296900 0 0 0 112 9043 19541 2 2 96 0 0 0 0 0 7315860 23264 296944 0 0 4 156 9025 21814 3 4 92 0 0 0 0 0 7315624 23288 297176 0 0 140 504 9221 19047 6 6 87 1 0 0 0 0 7314872 23296 297140 0 0 4 112 9499 21123 3 8 89 0 0 3 0 0 7314492 23344 297092 0 0 4 1784 9725 24151 5 6 88 0 0 1 0 0 7314796 23352 297192 0 0 0 176 9624 22662 4 7 89 0 0 3 0 0 7314556 23368 297176 0 0 4 220 9789 23502 5 6 88 0 0 2 0 0 7313820 23384 297196 0 0 4 348 9531 23117 14 13 74 0 0 1 0 0 7313468 23432 297148 0 0 12 504 9852 25504 6 11 83 0 0 2 0 0 7313104 23440 297268 0 0 4 112 9610 26564 6 7 88 0 0 0 0 0 7312364 23464 297244 0 0 128 356 9608 23673 5 8 87 0 0

Dmesg Link is below

Kind Regards

What is the actual problem? You report high load, which could be a useful clue to figure out what's causing your problem, but what is the actual problem? Is performance poor? — David Schwartz
– David Schwartz, Commented Jul 17, 2013 at 20:05
This is the problem, we don't know whats causing it. Obviously it gets much much worse when we start calling. When idle, the server load can still be around 0.09 - 0.2. — TheMightY
– TheMightY, Commented Jul 17, 2013 at 20:08
@tomtom: I'll have you know the 1950 was the height of Dell's 9th generation product line! It was so good that after that, they changed the naming convention! Just because they're on the 13th generation now... — Satanicpuppy
– Satanicpuppy, Commented Jul 17, 2013 at 20:19
Maybe an hardware issue!? Wich protocol is used (sip, iax, isdn, bri, other?) How are distributed your interrupts cat /proc/interrupts? What's in your logs: kern.log and maybe asterisk.log... — F. Hauri - Give Up GitHub
– F. Hauri - Give Up GitHub, Commented Jul 17, 2013 at 20:41

Matt W · Accepted Answer · 2013-07-17 20:19:26Z

Things like this vary a lot. For instance, are you recording calls? If so, are you using Monitor or MixMonitor? Monitor is processed in the same thread as the call, MixMonitor in it's own thread. And if you are recording, you probably have a solid disk hit. I solve some of this by turning off atime in /etc/fstab.

Something you can do to get a idea of what is going on in your system is to run vmstat. A simple vmstate 1 20 will you an optput to look at and you can see what is eating at the CPU.

Another thing that you can do with asterisk is remove modules you don't need by adding "noload =>" lines to modules.conf. Often, there are a lot. You'll just have to take some time to learn what modules you do and don't use as all are autoloaded during startup.

One more thing to consider is trans-coding. If you're accepting calls using the G.729A codec and your softphones/deskphones use G.711u, you're going to take a performance hit as as it has to trans-code those codecs and can't just preform packet-2-packet bridging.

To clarify: - No Call Recording/Monitoring - Transcoding is about 30% of the calls. (However this would be CPU from understanding) — TheMightY
– TheMightY, Commented Jul 17, 2013 at 20:30
Interrupts look kind of high. Are you running any PRIs in this system? I've run into issues before where I've started turning off serial and USB ports in the BIOS to cut down. PRIs cause this go go a lot higher. — Matt W
– Matt W, Commented Jul 17, 2013 at 21:41
I have a very similar 1950 setup as yours and I can reach 200 calls easy without issue. The difference is I'm running 1.8. This leads me to believe that it may be Asterisk 1.4. While I understand you need it for faxing purposes, maybe it's time to split those operations onto different servers? That or see if there is some port available for the new code. — Matt W
– Matt W, Commented Jul 18, 2013 at 16:51

Stefan · Accepted Answer · 2013-07-17 20:02:42Z

0

I found Munin helpful to identify bottlenecks. You can easily spot limits when a graph does not scale as the others.

answered Jul 17, 2013 at 20:02

Stefan

1

Thanks for this, I should have mentioned we use eLuna and they backup what were seeing in the SSH sessions

TheMightY
– TheMightY

2013-07-17 20:08:27 +00:00
Commented Jul 17, 2013 at 20:08

Add a comment |

Stack Exchange Network

High Load, low CPU, Memory and Disk IO - Highend Server

2 Answers 2

You must log in to answer this question.

Hot Network Questions

High Load, low CPU, Memory and Disk IO - Highend Server

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions