As a brief, it looks like current Elastic Beanstalk logrotation configuration is broken, which causes service downtime, 504 Gateway Timeout. Let's take a look.
Reproduction
We create simplest Python WSGI application.
application.py
import time def application(environ, start_response): # somewhat realistic response duration time.sleep(0.5) status = '200 OK' response_headers = [('Content-type', 'text/plain')] start_response(status, response_headers) return [b'Hello world!\n']
Zip it to application.zip. Then create Elastic Beanstalk Python application and environment, upload the archive. Make sure you use a key pair that you possess. Leave other settings default. Wait until it's done (several minutes).
ssh
into underlying EC2 instance (see instance identifier in EB's log). Type (httpd
's logrotate post-action, see below):
sudo /sbin/service httpd reload
Then on your machine run:
siege -v -b -c 10 -t 10S http://your-test-eb.you-aws-region.elasticbeanstalk.com/
And while it runs, repeat reload command a couple of times.
Then you're expected to see something like the following:
** SIEGE 3.0.8 ** Preparing 10 concurrent users for battle. The server is now under siege... HTTP/1.1 200 0.63 secs: 13 bytes ==> GET / HTTP/1.1 200 0.65 secs: 13 bytes ==> GET / HTTP/1.1 200 0.64 secs: 13 bytes ==> GET / HTTP/1.1 200 0.60 secs: 13 bytes ==> GET / ...
Here's what happens when you reload
.
HTTP/1.1 504 0.06 secs: 0 bytes ==> GET / HTTP/1.1 504 0.07 secs: 0 bytes ==> GET / HTTP/1.1 504 0.08 secs: 0 bytes ==> GET / HTTP/1.1 504 0.10 secs: 0 bytes ==> GET / HTTP/1.1 504 0.11 secs: 0 bytes ==> GET / HTTP/1.1 504 0.66 secs: 0 bytes ==> GET / HTTP/1.1 504 0.19 secs: 0 bytes ==> GET / HTTP/1.1 504 0.20 secs: 0 bytes ==> GET / HTTP/1.1 504 0.09 secs: 0 bytes ==> GET /
Then it recovers.
HTTP/1.1 200 1.25 secs: 13 bytes ==> GET / HTTP/1.1 200 1.24 secs: 13 bytes ==> GET / HTTP/1.1 200 1.26 secs: 13 bytes ==> GET / ... Lifting the server siege.. done. Transactions: 75 hits Availability: 81.52 % Elapsed time: 9.40 secs Data transferred: 0.00 MB Response time: 1.21 secs Transaction rate: 7.98 trans/sec Throughput: 0.00 MB/sec Concurrency: 9.68 Successful transactions: 75 Failed transactions: 17 Longest transaction: 4.27 Shortest transaction: 0.06
Note that ELB doesn't seem to have any effect on the problem, and the same can be reproduced with two SSH sessions to underlying EC2 and (Amazon AMI doesn't have siege
):
ab -v 4 -c 10 -t 10 http://your-test-eb.you-aws-region.elasticbeanstalk.com/
Cause
/etc/cron.hourly/cron.logrotate.elasticbeanstalk.httpd.conf
#!/bin/sh test -x /usr/sbin/logrotate || exit 0 /usr/sbin/logrotate /etc/logrotate.elasticbeanstalk.hourly/logrotate.elasticbeanstalk.httpd.conf
/etc/logrotate.elasticbeanstalk.hourly/logrotate.elasticbeanstalk.httpd.conf
/var/log/httpd/* { size 10M missingok notifempty rotate 5 sharedscripts compress dateext dateformat -%s create postrotate /sbin/service httpd reload > /dev/null 2>/dev/null || true endscript olddir /var/log/httpd/rotated }
Notice postrotate
. /sbin/service
is just a System V wrapper for scripts in /etc/init.d/
. Its man page says:
service
runs a System V init script in as predictable environment as possible, removing most environment variables and with current working directory set to /
.
Note that reload
is not standard Apache maintenance command. It's the distro's downstream addition. Let's look in the init script, /etc/init.d/httpd. Relevant part follows:
reload() { echo -n $"Reloading $prog: " check13 || exit 1 killproc -p ${pidfile} $httpd -HUP RETVAL=$? echo }
As you can see it sends HUP
signal to Apache, which is interpreted as Restart Now:
Sending the HUP
or restart
signal to the parent causes it to kill off its children like in TERM
, but the parent doesn't exit. It re-reads its configuration files, and re-opens any log files. Then it spawns a new set of children and continues serving hits.
TERM
explains 504s pretty well. But how it should have probably been done is Graceful Restart, as it also re-opens logs but doesn't terminate requests being served:
The USR1
or graceful
signal causes the parent process to advise the children to exit after their current request (or to exit immediately if they're not serving anything). The parent re-reads its configuration files and re-opens its log files. As each child dies off the parent replaces it with a child from the new generation of the configuration, which begins serving new requests immediately.
...
The code was written to both minimize the time in which the server is unable to serve new requests (they will be queued up by the operating system, so they're not lost in any event) and to respect your tuning parameters.
Workaround
It's possible to use .ebextensions
to replace /etc/logrotate.elasticbeanstalk.hourly/logrotate.elasticbeanstalk.httpd.conf
. In the root directory create .ebextensions/10_logs.config with the following contents (basically replace "reload" with "graceful"):
files: "/etc/logrotate.elasticbeanstalk.hourly/logrotate.elasticbeanstalk.httpd.conf": mode: "000644" owner: root group: root content: | /var/log/httpd/* { size 10M missingok notifempty rotate 5 sharedscripts compress dateext dateformat -%s create postrotate /sbin/service httpd graceful > /dev/null 2>/dev/null || true endscript olddir /var/log/httpd/rotated }
And re-deploy your Elastic Beanstalk environment. Note, however with subsequent sub-second graceful restarts I was able to (sporadically) produce 503 Service Unavailable, which, though, is not the case with log rotations as with evenly spaced graceful restarts there was no error.