I manage some stateless services that work over TCP, and we are working to ensure reliability and recoverability in case of a regional outage. I find that I lack enough understanding about the lifecycle of a TCP connection, DNS resolution for the connection, and the way a client library would implement this. I want to understand how these components behave in a few failure modes.
Components of the infrastructure:
- Service A deployed in two different regions. Region X and Y.
- Let's say Service A in Region X has IP address IP_A_X, and Service A in Region Y has IP_A_Y.
- Geographically distributed DNS servers, also in Region X and Y.
- A Geo-aware DNS name that resolves IP_A_X for requests coming from region X, and IP_A_Y for region Y.
The Geo-aware DNS infrastructure health-checks Service A in each region:
- If Service A goes down in Region X, then it will fail health checks, and our Geo-aware DNS servers will start returning IP_A_Y for all DNS requests.
I want to understand a couple scenarios and questions about a failover from Region X to Y.
Consider this simple scenario:
- Client W starts in region X and opens a persistent connection to IP_A_X
- Service A dies in region X while the connection above remains live
- Soon enough, our Geo-aware DNS notices the death of Service A in region X and starts returning IP_A_Y to all requests.
My questions:
- Will Client W ever re-resolve the DNS name for Service A? Is this client-library dependent? Or does this happen at the OS-level?
- What will be the time diagram/series of steps between application,client library and OS for Client W to re-resolve DNS?
- If Client W will just continue to retry connections to IP_A_X - what would be a usual way to trigger the re-resolution of DNS?
For simplicity, let's assume some recent Linux as OS, and if needed, a specific client library / language of your choice?