1

So normally you can get PHP FPM's status by querying the /fpmstatus endpoint (or whatever is configured). This displays info about FPM's status, threads, etc. FPM has a maximum number of child threads it allows (e.g. 48 in our case). Under normal circumstances the requests are processed in a fraction of a second, so that usually no more than about a third of the child processes are Running at any given time; normally they are Idle. This means that there's always at least one child available to handle the status check.

However if we experience a sudden influx of traffic, each FPM instance can suddenly find all 48 of its children occupied indefinitely; so many requests are incoming that the instant a request completes, there's another one running.

Here's the problem: When this happens, it becomes impossible to ask FPM for its status, because the status check needs one of the children available in order to run. You'd think we could just scale up our pods (yes, we're using Kubernetes) but the problem is that we use the process utilization of FPM to trigger scaling up... and if we can't find out what the process utilization is, then we can't automatically find out when it's too high.

Generally speaking no monitoring mechanism should use the very mechanism it's monitoring to do the monitoring, because if that mechanism becomes unavailable, the monitoring no longer works. FPM's design does not account for this; there appears to be no out-of-band mechanism for getting FPM's status.

I realize we could use proxies like examining the process table, but there simply being 48 child processes does not mean that they're actually being used. Hopefully there's some simple answer to this I've missed, but otherwise I'm going to have to delve into attempting to get FPM to add the ability to make out-of-band status requests.

1 Answer 1

2

Pool configuration pm.status_listen

The address on which to accept FastCGI status request. This creates a new invisible pool that can handle requests independently. This is useful if the main pool is busy with long running requests because it is still possible to get the FPM status page before finishing the long running requests. The syntax is the same as for listen directive. Default value: none.

So a different socket, and map the web server so only the status URL to it.

Status is still served through the php engine. Whether that is a problem depends on the type of loads or faults this load balancer needs to deal with. Let's imagine increasingly more difficult load tests:

  • High request rate from your favorite http load generator. The requests might be finished fast enough that 48 threads aren't kept busy for long. So depending on performance, this may or may not cause some scale out events and not be testing a serious condition.

  • Low and slow requests. Something like an implementation of Slowloris, which keeps connections open and just doesn't shut them down. If these go through php, and aren't otherwise mitigated, they would keep processes busy indefinitely. However, these are the long running requests status_listen is supposed to work around.

  • Things go completely unresponsive. Severe problems in I/O, memory reclaim, CPU contention, deadlock on some resource. Not easy to simulate in testing, so perhaps send SIGSTOP to all php processes inside the container. There will be no response to fpm status, indeed the status counters cannot be updated, because the processes are hung up.

If the load balancer needs to do something even when the status cannot be collected, that cannot be the only metric to make decisions on. Quite possible you don't consider making this more robust worth the effort, but if so read on.

Counting php processes is not my favorite alternative. That should be more expensive than fpm status, while being of questionable utility when pm=static or the processes are otherwise at maximum.

CPU utilization is a metric to consider, even if it does not tell the full story of the workload. Its built into for example Kubernetes HorizontalPodAutoscaler. So you can at least provide relief if say 80% CPU is exceeded.

Check if you can scale based on number of connections per pod or similar. This implies collecting fancy metrics about total connections or queued requests or network traffic, but would not be dependent on a php process in the application being responsive.

1
  • pm.status_listen solves our immediate problem, thanks for pointing it out! Commented May 27 at 16:43

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.