4

We have nginx as a reverse proxy, load balancing across 2 application servers. These application servers were defined in upstream blocks like so:

 upstream app_backends { server 1.1.1.1:8080 max_fails=1 fail_timeout=120s; server 1.1.1.2:8080 max_fails=1 fail_timeout=120s; } 

We had a significant outage where a client was sending a request with a large cookie header, which the uwsgi application choked on and closed the connection early. This resulted in nginx marking a failure on that backend, and then immediately sending the request to the second backend which would choke in exactly the same way. Then nginx would mark both backends down and only respond to requests from all clients with 502s for the next two minutes.

Once we understood the problem we easily fixed it by setting max_fails=0. This resulted in the client in question, with the large cookie header, getting 502s, but all other clients could continue to use the application without issue. But of course this means nginx isn't offering any protection against failures in our backends.

We actually have this same configuration across a number of different applications, and I'm trying to understand what is the safest general configuration for our setup.

The default values in nginx for these two settings are max_fails=1 and fail_timeout=10. Our problem was obviously exacerbated by the fact that we had fail_timeout=120s, but even if it had been 10s, this still would have resulted in our application being taken down completely for 10 seconds at a time whenever this particular client with a large cookie header made a request.

It seems like a bad pattern in general that a single fault in response to a request, which may be a special case request like ours was, leads to a whole backend being taken offline? Especially where we have no idea if the same error will apply to all backends equally, as it did in this case?

What I'm asking is: Would it be generally a safer configuration for our setup to use max_fails=0 for all our apps rather than to use the actual nginx default of max_fails=1 fail_timeout=10s? And if so, is this potentially an argument for nginx to change its default?

5
  • 2
    It usually solved by using proper health checks instead of random failures to remove a server from the backends pool. Commented Feb 19, 2024 at 16:39
  • 1
    In addition, nginx a a great rt manual and documentation and also a great Blog that explain the importance howto. in your current way it's a 50/50 chance that a client will hit the inactive instance imho Commented Feb 19, 2024 at 20:33
  • 1
    I agree, these are horrible defaults for nginx and extremely easy to exploit for a DDOS. It's pretty lame they require the very expensive pro in order to do proper health checks. I'd been having occasional errors from nginx when restarting some my servers with max_fails=0, I think because maybe it gives up once it tries two that fail. But it's still better than the alternative imo. So far this is the biggest failing that I've found from nginx. Commented Mar 20, 2024 at 23:20
  • @djdomi Amazingly, even though that documentation says "Note that if there is only a single server in a group, the fail_timeout and max_fails parameters are ignored and the server is never marked unavailable" which sounds very sensible, it would seem that if there are two servers in the group, it will happily take them both offline. Commented Apr 5, 2024 at 12:14
  • @AlexD I assume, from djdomi's documentation link, that what you're referring to is "Active health checks". I agree, this would solve this issue, but they're only available in nginx plus. So for standard nginx, which can't do proper healthchecks, I am increasingly convinced that "max-fails: 0" is a much more sensible default. Commented Apr 5, 2024 at 12:17

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.