High load average, low CPU usage - why?

Question

We're seeing huge performance problems on a web application and we're trying to find the bottleneck. I am not a sysadmin so there is some stuff I don't quite get. Some basic investigation shows the CPU to be idle, lots of memory to be available, no swapping, no I/O, but a high average load.

The software stack on this server looks like this:

Solaris 10
Java 1.6
WebLogic 10.3.5 (8 domains)

The applications running on this server talk with an Oracle database on a different server.

This server has 32GB of RAM and 10 CPUs (I think).

Running prstat -Z gives something like this:

 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 3836 ducm0101 2119M 2074M cpu348 58 0 8:41:56 0.5% java/225 24196 ducm0101 1974M 1910M sleep 59 0 4:04:33 0.4% java/209 6765 ducm0102 1580M 1513M cpu330 1 0 1:21:48 0.1% java/291 16922 ducm0102 2115M 1961M sleep 58 0 6:37:08 0.0% java/193 18048 root 3048K 2440K sleep 59 0 0:06:02 0.0% sa_comm/4 26619 ducm0101 2588M 2368M sleep 59 0 8:21:17 0.0% java/231 19904 ducm0104 1713M 1390M sleep 59 0 1:15:29 0.0% java/151 27809 ducm0102 1547M 1426M sleep 59 0 0:38:19 0.0% java/186 2409 root 15M 11M sleep 59 0 0:00:00 0.0% pkgserv/3 27204 root 58M 54M sleep 59 0 9:11:38 0.0% stat_daemon/1 27256 root 12M 8312K sleep 59 0 7:16:40 0.0% kux_vmstat/1 29367 root 297M 286M sleep 59 0 11:02:13 0.0% dsmc/2 22128 root 13M 6768K sleep 59 0 0:10:51 0.0% sendmail/1 22133 smmsp 13M 1144K sleep 59 0 0:01:22 0.0% sendmail/1 22003 root 5896K 240K sleep 59 0 0:00:01 0.0% automountd/2 22074 root 4776K 1992K sleep 59 0 0:00:19 0.0% sshd/1 22005 root 6184K 2728K sleep 59 0 0:00:31 0.0% automountd/2 27201 root 6248K 344K sleep 59 0 0:00:01 0.0% mount_stat/1 20964 root 2912K 160K sleep 59 0 0:00:01 0.0% ttymon/1 20947 root 1784K 864K sleep 59 0 0:02:22 0.0% utmpd/1 20900 root 3048K 608K sleep 59 0 0:00:03 0.0% ttymon/1 20979 root 77M 18M sleep 59 0 0:14:13 0.0% inetd/4 20849 daemon 2856K 864K sleep 59 0 0:00:03 0.0% lockd/2 17794 root 80M 1232K sleep 59 0 0:06:19 0.0% svc.startd/12 17645 root 3080K 728K sleep 59 0 0:00:12 0.0% init/1 17849 root 13M 6800K sleep 59 0 0:13:04 0.0% svc.configd/15 20213 root 84M 81M sleep 59 0 0:47:17 0.0% nscd/46 20871 root 2568K 600K sleep 59 0 0:00:04 0.0% sac/1 3683 ducm0101 1904K 1640K sleep 56 0 0:00:00 0.0% startWebLogic.s/1 23937 ducm0101 1904K 1640K sleep 59 0 0:00:00 0.0% startWebLogic.s/1 20766 daemon 5328K 1536K sleep 59 0 0:00:36 0.0% nfsmapid/3 20141 daemon 5968K 3520K sleep 59 0 0:01:14 0.0% kcfd/4 20093 ducm0101 2000K 376K sleep 59 0 0:00:01 0.0% pfksh/1 20797 daemon 3256K 240K sleep 59 0 0:00:01 0.0% statd/1 6181 root 4864K 2872K sleep 59 0 0:01:34 0.0% syslogd/17 7220 ducm0104 1268M 1101M sleep 59 0 0:36:35 0.0% java/138 27597 ducm0102 1904K 1640K sleep 59 0 0:00:00 0.0% startWebLogic.s/1 27867 root 37M 4568K sleep 59 0 0:13:56 0.0% kcawd/7 12685 ducm0101 4080K 208K sleep 59 0 0:00:01 0.0% vncconfig/1 ZONEID NPROC SWAP RSS MEMORY TIME CPU ZONE 42 135 22G 19G 59% 87:27:59 1.2% dsuniucm01 Total: 135 processes, 3167 lwps, load averages: 54.48, 62.50, 63.11

I understand that CPU is mostly idle, but the load average is high, which is quite strange to me. Memory doesn't seem to be a problem.

Running vmstat 15 gives something like this:

 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 s1 s4 sd in sy cs us sy id 0 0 0 32531400 105702272 317 1052 126 0 0 0 0 13 13 -0 8 9602 107680 10964 1 1 98 0 0 0 15053368 95930224 411 2323 0 0 0 0 0 0 0 0 0 23207 47679 29958 3 2 95 0 0 0 14498568 95801960 3072 3583 0 2 2 0 0 3 3 0 21 22648 66367 28587 4 4 92 0 0 0 14343008 95656752 3080 2857 0 0 0 0 0 3 3 0 18 22338 44374 29085 3 4 94 0 0 0 14646016 95485472 1726 3306 0 0 0 0 0 0 0 0 0 24702 47499 33034 3 3 94

I understand that the CPU is mostly idle, no processes are waiting in the queue to be executed, little swapping is happening.

Running iostat 15 gives this:

 tty sd0 sd1 sd4 ssd0 cpu tin tout kps tps serv kps tps serv kps tps serv kps tps serv us sy wt id 0 676 324 13 8 322 13 8 0 0 0 159 8 0 1 1 0 98 1 1385 0 0 0 0 0 0 0 0 0 0 0 0 3 4 0 94 0 584 89 6 24 89 6 25 0 0 0 332 19 0 2 1 0 97 0 296 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 97 1 1290 43 5 24 43 5 22 0 0 0 297 20 1 3 3 0 94

Running netstat -i 15 gives the following:

 input aggr26 output input (Total) output packets errs packets errs colls packets errs packets errs colls 1500233798 0 1489316495 0 0 3608008314 0 3586173708 0 0 10646 0 10234 0 0 26206 0 25382 0 0 11227 0 10670 0 0 28562 0 27448 0 0 10353 0 9998 0 0 29117 0 28418 0 0 11443 0 12003 0 0 30385 0 31494 0 0

What am I missing?

I'm not at home with Solaris, so I'll defer to someone else for this, but I'd start looking at your web server config. Perhaps something is artificially gating performance in such a way as to leave lots of threads in the run queue. (Not sure what that could be or even if it's possible, though). Kudos for a well written question, though. — SmallClanger
– SmallClanger, Commented Feb 29, 2012 at 23:27
10 CPUs (I think) is possibly the issue. You should know more precisely what hardware you are running before investigating further. Use psrinfo -v to display the actual number of CPUs. — jlliagre
– jlliagre, Commented Mar 1, 2012 at 9:04
I've never heard of this command, but when running it it looks like there's around 250 virtual processors. Does that even make sense? In that case a load average of 50 would be insignificant? — Spiff
– Spiff, Commented Mar 1, 2012 at 13:43
I think this can also happen when your disk is full. I had this today with 1% free space on / and load kept increasing until over 19.00 with no visible reason. Making some space free solved the problem (shortly after it came down); can also be a coincidence though. — nh2
– nh2, Commented Mar 19, 2016 at 19:48

Spiff · Accepted Answer · 2012-03-02 15:15:48Z

With some further investigation, it appears that the performance problem is mostly due to a high number of network calls between two systems (Oracle SSXA and UCM). The calls are quick but plenty and serialized, hence the low CPU usage (mostly waiting for I/O), the high load average (many calls waiting to be processed) and especially the long response times (by accumulation of small response times).

Thanks for your insight on this problem!

how did you confirm and figured this out? We are seeing same issue and would like to check if we have the same problem — hobgoblin
– hobgoblin, Commented Nov 2, 2018 at 20:31

webtoe · Accepted Answer · 2012-03-01 00:36:04Z

When you say 'High Load average' I assume you mean that prstat shows for 'load average' at the bottom of the output figures of

Total: 135 processes, 3167 lwps, load averages: 54.48, 62.50, 63.11

These numbers, look similar to the ones that top provides and probably mean the average queue size of running process. This isn't the percentage of processor time being used but how many 'things' are harassing the CPU for time to run. Admittedly, these do look quite high but this all depends on the app that you are running; the processes may not actually be doing much once they get their slot. See here for a nice explanation regarding top.

I'm not familiar with WebLogic but I have noticed that, generally, with Apache Tomcat many Java threads can be spawned simultaneously for what appears as not many requests. It could be this that is causing those high average load numbers. Make sure that you are using connection pooling where appropriate to connect to the backend and consider upping the number of idle threads that are available to your app to handle connections (not sure how you do this on WebLogic; Tomcat has a per connector thread pool or a general executor thread pool). If you don't do this then brand new threads may be being spawned to process requests.

As to performance, you need to nail down what part of your app is suffering. Is it the processing that is happening in the WebLogic/Java side of things, the database access, DNS lookups (if they're being done for some reason...), network issues or something on the OS.

99% of the time it will be your code and how it talks to the database that is holding things up. Then it will be configuration of the web app. Past this point you will be working on squeezing the last milliseconds out of your app or looking at providing higher concurrency with the same hardware. For this finer grained performance tuning you need metrics.

For Java I'd suggest installing Java Melody. It can provide a lot of info regarding what your program is doing and help narrow down where it is spending time. I've only used it with Tomcat but should work fine with any Java EE container/servlet thingy.

There are a number of ways you can tune Java, so take a look at their performance guidelines (I'm sure you probably have) and make sure you're setting the correct Heap Size etc. suitable for your program. Java Melody can help you track down the size of Java's heap you're consuming as well as how hard the garbage collector is working/how often it is interrupting your program to clear objects.

I hope that has been helpful. If you provide any more information, I may be able to update this answer and hone it more towards your needs.

Thanks for your answer, if my rep were high enough I would upvote it. From my experience code or SQL queries are usually the culprit. I did a few profiling runs and couldn't find any hot spot, which is why I started looking at more fundamental factors. I will investigate some more and update the question as I find more. — Spiff
– Spiff, Commented Mar 1, 2012 at 14:02
I would also check the output of 'mpstat 1 5' to view the per-processor statistics and look at the "csw" and "syscl" columns. From your vmstat above it looks like you are doing quite a lot of system calls and context switches, which would seem to validate webtoe's suspicion that you have a lot of threads (Solaris calls them LWPs- LightWeight Processes) constantly harassing the CPU. None of them are doing very much when they are running but many are spending time waiting to run, hence the high load averages. — eirescot
– eirescot, Commented Mar 1, 2012 at 22:43

rogerdpack · Accepted Answer · 2018-01-25 16:03:31Z

As a side note, load average also includes things waiting for disk activity (i.e. harassing the disk) as well as those waiting for cpu, it's a sum of both...so you might have problems in one or the other.

See http://en.wikipedia.org/wiki/Load_(computing) "Linux also includes [in its load average] processes in uninterruptible sleep states (usually waiting for disk activity)"

As a side note, the particular problem I ran into was that I had high load average, but also lots of idle cpu and low disk usage.

It appears that, at least in my case, sometimes threads/processes waiting for I/O show up in the load average, but do not cause an increase in the "await" column. But they're still I/O bound.

You can tell that this is the case with the following code, if you run it in jruby (just does 100 threads with lots of I/O each):

100.times { Thread.new { loop { File.open('big', 'w') do |f| f.seek 10_000_000_000; f.puts 'a'; end}}}

Which gives a top output like this:

top - 17:45:32 up 38 days, 2:13, 3 users, load average: 95.18, 50.29, 23.83 Tasks: 181 total, 1 running, 180 sleeping, 0 stopped, 0 zombie Cpu(s): 3.5%us, 11.3%sy, 0.0%ni, 85.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 32940904k total, 23239012k used, 9701892k free, 983644k buffers Swap: 34989560k total, 0k used, 34989560k free, 5268548k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 31866 packrd 18 0 19.9g 12g 11m S 117.0 41.3 4:43.85 java 912 root 11 -5 0 0 0 S 2.0 0.0 1:40.46 kjournald

So you can see that it has lots of idle cpu, 0.0%wa, but a very high load average.

iostat similarly shows the disk as basically idle:

avg-cpu: %user %nice %system %iowait %steal %idle 9.62 0.00 8.75 0.00 0.00 81.62 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 49.00 0.00 6.40 0.00 221.60 69.25 0.01 0.81 0.66 0.42 sda1 0.00 49.00 0.00 6.40 0.00 221.60 69.25 0.01 0.81 0.66 0.42 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

see also http://linuxgazette.net/141/misc/lg/tracking_load_average_issues.html

As a further side note, this also seems to imply that (at least in this case--running CentOS) the load average includes each thread separately in the total.

"load average also includes things waiting for disk activity" on Linux, while this question was originally about Solaris, which appears to only include running and runnable (i.e. waiting for CPU) tasks in load average. One Linux version of this question is this. — Nickolay
– Nickolay, Commented Sep 12, 2018 at 11:49
In my case CPU usage is extremely low and no IO wait too. How to identify processes/threads that is I/O bound without await but contributing to system load? — goodfella
– goodfella, Commented May 22 at 7:34

BinaryMisfit · Accepted Answer · 2013-01-23 18:29:49Z

Had the same problem today. After some research and diagnoses I realised that my small VPS was running out of disk.

In shell/prompt (Linux/Unix)type

df -h

to see the disk free on your machine. If you are running out of disk that can be the problem/issue.

were you swapping then, I presume, so that was causing it?

rogerdpack
– rogerdpack

2013-07-19 16:33:13 +00:00
Commented Jul 19, 2013 at 16:33 — rogerdpack
– rogerdpack, Commented Jul 19, 2013 at 16:33

StackzOfZtuff · Accepted Answer · 2022-10-04 22:08:54Z

Try nmon

Another useful tool that will help in this situation is nmon.

It includes a variety of ways to view the same data presented by the other tools, in one little package.

If this is content that cannot be cached I would recommend placing multiple servers behind a load balancer such as haproxy in tcp mode to distribute the load.

cachonfinga · Accepted Answer · 2016-06-23 14:20:22Z

Just to add to this, some Solaris specific tools which haven't been mentioned which are useful in debugging such issues are "intrstat", "mpstat" and "lockstat". Having experienced a similar issue before on a host running some heavy ETL loads mpstat revealed a high amount of interrupts dealing with lots of I/O which hinted at the issue.

At the time, on a T4-4 with mpstat we saw vcpus handing in excess of 30000 interrupts over short monitoring cycle, after which performance started to suffer. In this case the only workaround was to throw more CPU at it however, work was subsequently carried out to improve the code.

Brendan Gregg has written a lot about performance, particularly around I/O over the years and is worth a search should you want to find out more.

Stack Exchange Network

High load average, low CPU usage - why?

6 Answers 6

Try nmon

You must log in to answer this question.

Linked

Hot Network Questions

High load average, low CPU usage - why?

6 Answers 6

Try nmon

You must log in to answer this question.

Linked

Related

Hot Network Questions