I have an app service that runs two nodes, and unlike others I have, it somehow eventually only keeps one running. To even get two running again, I had to scale up and back down, which for some reason "triggers" both nodes to run. You can see in the graph below that everything looks fine until around 6pm, when it looks like one node dies and another spins up. It happens again around 7:10, and by 7:30, only one node is servicing requests.
Trying to diagnose this is maddening. I have sticky sessions on for SignalR (backplane through Redis), but I know from other apps that it shouldn't matter. The logs show the new containers starting up, but I can't find anything that tells me why the previous ones die. There is another app on this app service plan that seems to distribute the request load consistently, so I don't think that it's the Azure infrastructure. I think it's my app, but I can't find the right logging to help.
So the question is, how do I find the reason for one node going bad?
EDIT: I'm less convinced that it's necessarily my app. I can drill into the health check UI and restart the specific instance not getting traffic, and there's no change.
