0

I have an app service that runs two nodes, and unlike others I have, it somehow eventually only keeps one running. To even get two running again, I had to scale up and back down, which for some reason "triggers" both nodes to run. You can see in the graph below that everything looks fine until around 6pm, when it looks like one node dies and another spins up. It happens again around 7:10, and by 7:30, only one node is servicing requests.

Trying to diagnose this is maddening. I have sticky sessions on for SignalR (backplane through Redis), but I know from other apps that it shouldn't matter. The logs show the new containers starting up, but I can't find anything that tells me why the previous ones die. There is another app on this app service plan that seems to distribute the request load consistently, so I don't think that it's the Azure infrastructure. I think it's my app, but I can't find the right logging to help.

So the question is, how do I find the reason for one node going bad?

EDIT: I'm less convinced that it's necessarily my app. I can drill into the health check UI and restart the specific instance not getting traffic, and there's no change.

request counts by node

3
  • Do yo have any scaling rules configured or is it just set to be 2 instance 100% of the time? Commented Feb 8, 2024 at 8:19
  • I did not read the manual. I answered below. Commented Feb 9, 2024 at 4:08
  • An app service also has auto healing properties. App Service has built-in auto-healing features that restart instances when certain conditions are met, such as excessive memory usage or unresponsiveness. While this is designed to improve availability, misconfigured auto-healing settings could potentially cause instances to restart more frequently than desired. Commented Feb 9, 2024 at 4:25

1 Answer 1

1

This was the result of me not reading the manual. In short, here's what happened:

  • The health check was being redirected, because I had middleware that would redirect for anything other than the canonical intended domain. Health check, however, hits the *.azurewebistes.net domain.
  • Health check deems anything not a 2xx response as unhealthy, including my 301's.
  • There are limits to how many instances the system will replace when "unhealthy" per day or hour or some other condition (the documentation is unclear). And obviously if you get down to one instance, it's not going to tear that one down.
  • I only figured this out because one of the diagnostic tools showed a flat line of 3xx requests, which had to be mechanical. Then I remembered the canonical redirects, and the health check was the only thing I could think of with that rhythm.
  • I added an endpoint that would not redirect, just for health check to ping, and both instances appear healthy and all is right with the world.

I wrote a detailed blog post about it, if you're really bored.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.