Linux server high load spikes

Question

I'm facing high load spikes on a linux server (Ubuntu 18.04, 16 core, 8GB Ram) : it is a webserver with apache 2.4, php7.2-fpm and memcached, no database services (which are provided by different server).

Those spikes are very fast and lasts for few seconds, this is what top shows

At first I thought it depends on abnormal CPU usage but, as shown in the picture, there is no process who made high use of cpu and RAM usage is relative low (25%). Instead the problem seems to be wa which are very high!

I've tried to investigate if it depends on I/O storage problem using iotop but it seems that when those spikes happened there is no increases of I/O reads or writes.

Those spikes happened many times in a day and, after some monitoring, I saw that they happened al specific time. I checked if exists processes scheduled to run at that specific time but haven't found anything.

I know that other high load causes could also depends by high bandwith usage but how can I monitor this?

I'm not very experienced with iostat o sar, are they "the right way" to troubleshooting my problem?

As Matthew Ife suggested I've tried with ps -ALo pid,tid,comm,wchan in high load moment and it gives some interesting output:

 35227 35227 kworker/12:2 worker_thread 53730 53730 kworker/7:0 worker_thread 57306 57306 php-fpm7.2 poll_schedule_timeout 57348 57348 php-fpm7.2 poll_schedule_timeout 57988 57988 php-fpm7.2 poll_schedule_timeout 58251 58251 php-fpm7.2 poll_schedule_timeout 60181 60181 kworker/5:2 worker_thread 62158 62158 kworker/2:1 worker_thread 62169 62169 php-fpm7.2 poll_schedule_timeout 65001 65001 php-fpm7.2 poll_schedule_timeout 69262 69262 php-fpm7.2 poll_schedule_timeout 69647 69647 kworker/6:1 worker_thread 72110 72110 php-fpm7.2 call_rwsem_down_write_failed 72638 72638 php-fpm7.2 skb_wait_for_more_packets 72845 72845 php-fpm7.2 call_rwsem_down_write_failed 72848 72848 php-fpm7.2 call_rwsem_down_write_failed 72850 72850 php-fpm7.2 skb_wait_for_more_packets 72892 72892 php-fpm7.2 skb_wait_for_more_packets 72909 72909 kworker/u256:2 worker_thread 72940 72940 kworker/u256:4 get_write_access 73353 73353 top poll_schedule_timeout 73367 73367 php-fpm7.2 locks_lock_inode_wait 73659 73659 php-fpm7.2 call_rwsem_down_write_failed 73950 73950 php-fpm7.2 skb_wait_for_more_packets 73953 73953 php-fpm7.2 call_rwsem_down_write_failed 74259 74259 php-fpm7.2 skb_wait_for_more_packets 74345 74345 php-fpm7.2 skb_wait_for_more_packets 74436 74436 kworker/13:1 worker_thread 74481 74481 kworker/u256:1 get_write_access 74519 74519 php-fpm7.2 skb_wait_for_more_packets 74522 74522 php-fpm7.2 skb_wait_for_more_packets 74576 74576 php-fpm7.2 call_rwsem_down_write_failed 74578 74578 php-fpm7.2 skb_wait_for_more_packets 74603 74603 php-fpm7.2 locks_lock_inode_wait 74849 74849 php-fpm7.2 skb_wait_for_more_packets 75085 75085 php-fpm7.2 skb_wait_for_more_packets 75088 75088 php-fpm7.2 call_rwsem_down_write_failed 75100 75100 php-fpm7.2 call_rwsem_down_write_failed 75171 75171 php-fpm7.2 skb_wait_for_more_packets 75283 75283 kworker/2:2 worker_thread ....

Does this mean that there's some problem with filesystem or storage?

IO wait does not cause load - until the system is in a really bad state. — symcbean
– symcbean, Commented Feb 27, 2024 at 9:26
IOWait doesn't necessarily always directly mean IO but sometimes other "kernel thread is busy" scenarios. Whilst you see it in this state, try to provide a output of ps -ALo pid,tid,comm,wchan and provide the output. If you're using NFS it may relate to that. — Matthew Ife
– Matthew Ife, Commented Feb 27, 2024 at 9:31
High 5min LA and high wait indicate that there were a lot of processes in an uninterruptable state. Run top and add a filter with o to S=D to see these stuck processes. — AlexD
– AlexD, Commented Feb 27, 2024 at 9:44

Matthew Ife · Accepted Answer · 2024-02-28 11:11:14Z

A few things pop out to me

kworker/u256:1 get_write_access php-fpm7.2 call_rwsem_down_write_failed php-fpm7.2 locks_lock_inode_wait

What this is telling you is that php is doing some file locking using fcntl or flock.

If you catch it again you can try doing something like this to print out the entire kernel stack for the stuck process(es). It often yields more information pertaining to the stuck filesystem in the backtrace.

for p in $(ps -ho pid,wchan $(pgrep php-fpm7.2) | grep -E 'locks_lock_inode_wait|call_rwsem_down_write_failed' | awk '{print $1}'); do echo $p cat /proc/$p/stack echo done

Also when this happens run the lslocks command and get a view of what files are being locked and where.

There isn't sufficient information in this post as of yet to determine what is happening or why this is broken. Its possible you've somehow managed to open a bajillion locks or something bizarre.

Its also possibly relating to the filesystem itself having a bug with file locking. My bet however is its some remote filesystem or even some fuse filesystem that is mega slow.

Alternatively you've hit upon a kernel bug to do with locking -- this is the type of thing that (typically) gets caught so you might just find simply updating to the latest kernel and rebooting fixes the problem and squashes the bug.

thanks, I'll try these checks hoping to find how to solve the problem — Sbraaa
– Sbraaa, Commented Feb 28, 2024 at 9:19
Agree with looking into /proc/$pid/stack but I think pgrep nginx should be pgrep php-fpm7.2 — AlexD
– AlexD, Commented Feb 28, 2024 at 9:37
Yes, I dont run fpm so used nginx to prototype :). Corrected it now. — Matthew Ife
– Matthew Ife, Commented Feb 28, 2024 at 11:11

Stack Exchange Network

Linux server high load spikes

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Linux server high load spikes

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions