So I think my server might be suffering a Denial of Service attack.
We got notified by pingdom (website monitoring) that our website was unavailable starting around 3AM. Early today we started checking apache error logs and saw a whole bunch of this error:
AH00485: scoreboard is full, not at MaxRequestWorkers
We also saw that our PHP-FPM process pool frequently needed to spawn more servers:
[pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children
We tried increasing MaxRequestWorkers in the apache conf and some other remedies but these would not rid us of the scoreboard error in the apache error log so, against my better judgement, I followed the advice in this thread and set MinSpareThreads and MaxSpareThreads equal to MaxRequestWorkers. These changes appear to have removed the scoreboard error.
I also greatly increased MaxRequestWorkers because we have a lot of RAM that evidently isn't being utilized. Our server has 8 cores and, despite these really high config values, doesn't seem to be using much of its RAM at all:
$ free -h total used free shared buff/cache available Mem: 7.8G 1.8G 2.0G 38M 4.0G 5.8G Swap: 0B 0B 0B I'm pretty nervous about these high values for MaxRequestWorkers in the apache conf and pm.max_children in php-fpm configuration.
Here's the basic config in mpm_event.conf
<IfModule mpm_event_module> StartServers 2 MinSpareThreads 800 MaxSpareThreads 800 ThreadLimit 64 ThreadsPerChild 25 ServerLimit 800 MaxRequestWorkers 800 MaxConnectionsPerChild 0 </IfModule> Here are some settings in a php-fpm conf file:
pm.max_children = 256 pm.start_servers = 64 pm.min_spare_servers = 64 pm.max_spare_servers = 128 Here's some basic server info:
Server version: Apache/2.4.18 (Ubuntu) Server built: 2019-10-08T13:31:25 Server's Module Magic Number: 20120211:52 Server loaded: APR 1.5.2, APR-UTIL 1.5.4 Compiled using: APR 1.5.2, APR-UTIL 1.5.4 Architecture: 64-bit Server MPM: event threaded: yes (fixed thread count) forked: yes (variable process count) And here's some of the data from the apache server-status output:
Server Version: Apache/2.4.18 (Ubuntu) OpenSSL/1.0.2g Server MPM: event Server Built: 2019-10-08T13:31:25 Current Time: Friday, 10-Jan-2020 22:58:55 CST Restart Time: Friday, 10-Jan-2020 22:26:32 CST Parent Server Config. Generation: 1 Parent Server MPM Generation: 0 Server uptime: 32 minutes 22 seconds Server load: 4.69 5.06 5.12 Total accesses: 78434 - Total Traffic: 1.5 GB CPU Usage: u2970.53 s5037.34 cu0 cs0 - 412% CPU load 40.4 requests/sec - 0.8 MB/second - 19.7 kB/request 797 requests currently being processed, 3 idle workers PID Connections Threads Async connections total accepting busy idle writing keep-alive closing 6124 28 yes 25 0 0 0 3 6125 27 yes 25 0 0 0 2 6182 30 yes 25 0 0 1 4 6210 28 yes 25 0 0 0 3 6211 29 yes 25 0 0 0 5 6266 28 yes 25 0 0 2 1 6267 25 yes 25 0 0 0 1 6269 28 no 24 1 0 1 3 6276 28 yes 25 0 0 0 3 6378 28 yes 25 0 0 0 3 6379 31 no 24 1 0 4 3 6380 27 yes 25 0 0 0 3 6384 26 yes 25 0 0 0 2 6397 28 yes 25 0 0 2 1 6405 27 yes 25 0 0 0 2 6414 26 yes 25 0 0 1 0 6423 27 no 24 1 0 1 1 6602 27 yes 25 0 0 0 3 6603 28 yes 25 0 0 0 4 6604 26 yes 25 0 0 0 1 6617 30 yes 25 0 0 0 5 6646 26 yes 25 0 0 0 2 6676 27 yes 25 0 0 0 2 6694 30 yes 25 0 0 0 5 6705 28 yes 25 0 0 0 3 6730 29 yes 25 0 0 0 4 6765 29 yes 25 0 0 0 4 6781 27 yes 25 0 0 0 2 6805 28 yes 25 0 0 0 4 6836 28 yes 25 0 0 0 3 6858 27 yes 25 0 0 0 3 6859 27 no 25 0 0 1 1 Sum 888 797 3 0 13 86 The worker mode part is the most disconcerting. Almost every single one is in read mode:
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR RRRRRRR_RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR _RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR_RRRRRRRRRRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR And at the end there's this:
SSL/TLS Session Cache Status: cache type: SHMCB, shared memory: 512000 bytes, current entries: 2176 subcaches: 32, indexes per subcache: 88 time left on oldest entries' objects: avg: 220 seconds, (range: 197...243) index usage: 77%, cache usage: 99% total entries stored since starting: 60122 total entries replaced since starting: 0 total entries expired since starting: 0 total (pre-expiry) entries scrolled out of the cache: 57946 total retrieves since starting: 3405 hit, 59594 miss total removes since starting: 0 hit, 0 miss And netstat shows some 3000+ connections to port 80 and port 443:
$ netstat -n | egrep ":80|443" | wc -l 3715 What the heck is going on? The server has been running fine for months with much more modest configuration settings. Something seems to have abruptly changed last night around 3AM.
Any guidance would be much appreciated. I searched here first and found this other thread but it's a different version of apache running in prefork mode instead of event like mine. I also don't understand how the little bit of information in that thread led to a SlowLoris diagnosis.
EDIT It would appear I have to phrase my questions more precisely:
1) How can I restore my server's responsiveness. Clearly, the apache workers getting stuck in R mode is a sympton of some problem.
2) Is there some reliable series of steps I can take to more specifically identify the actual problem?
3) Is there any way to confirm that the machine is under a DoS attack?
egrep ":80|443"should beegrep ":80|:443"