DNS re-resolution and failover of TCP connections

Question

I manage some stateless services that work over TCP, and we are working to ensure reliability and recoverability in case of a regional outage. I find that I lack enough understanding about the lifecycle of a TCP connection, DNS resolution for the connection, and the way a client library would implement this. I want to understand how these components behave in a few failure modes.

Components of the infrastructure:

Service A deployed in two different regions. Region X and Y.
Let's say Service A in Region X has IP address IP_A_X, and Service A in Region Y has IP_A_Y.
Geographically distributed DNS servers, also in Region X and Y.
A Geo-aware DNS name that resolves IP_A_X for requests coming from region X, and IP_A_Y for region Y.

The Geo-aware DNS infrastructure health-checks Service A in each region:

If Service A goes down in Region X, then it will fail health checks, and our Geo-aware DNS servers will start returning IP_A_Y for all DNS requests.

I want to understand a couple scenarios and questions about a failover from Region X to Y.

Consider this simple scenario:

Client W starts in region X and opens a persistent connection to IP_A_X
Service A dies in region X while the connection above remains live
Soon enough, our Geo-aware DNS notices the death of Service A in region X and starts returning IP_A_Y to all requests.

My questions:

Will Client W ever re-resolve the DNS name for Service A? Is this client-library dependent? Or does this happen at the OS-level?
What will be the time diagram/series of steps between application,client library and OS for Client W to re-resolve DNS?
If Client W will just continue to retry connections to IP_A_X - what would be a usual way to trigger the re-resolution of DNS?

For simplicity, let's assume some recent Linux as OS, and if needed, a specific client library / language of your choice?

It may help to review some of vendor solutions for this. AWS has a Traffic Manager, as does F5 (,Global Traffic Manager). techdocs.f5.com/kb/en-us/products/big-ip_gtm/manuals/product/… docs.aws.amazon.com/Route53/latest/DeveloperGuide/… — Greg Askew
– Greg Askew, Commented Jul 9, 2024 at 5:51

Steffen Ullrich · Accepted Answer · 2024-07-08 20:42:47Z

The case that "Service A dies in region X while the connection above (to service A) remains live" can only happen if the connection is idle, i.e. no data transfer from the client is initiated and no transport from server to client is expected. Without this condition a properly implemented client will realize, that the connection is broken either because transmission fails or expected data from server fail to happen. With TCP keep-alive the client can even detect if an idle connection is broken.

The client will only try another connection if it has realized that the existing connection is broken. If the reconnection involves a new DNS lookup and if a reconnection is attempted in the first place (instead of for example throwing an error) depends on the client implementation. But I expect the majority of clients to start with a fresh DNS lookup on reconnect.

I appreciate your response, however it doesn't go to the level of detail I'm looking for. What you described is very close to what I already understand. — Pablo
– Pablo, Commented Jul 9, 2024 at 14:11
@Pablo: Given that the behavior is client dependent (1) the sub-questions (2) and (3) cannot be answered. As for explicit re-triggering DNS resolution - impossible or client dependent. Some will retry DNS if the resolved IP fails to reconnect. As for "What you described is very close to what I already understand." - to avoid getting what you already know it would have been helpful to actually add to your question what you already know instead of just stating questions. — Steffen Ullrich
– Steffen Ullrich, Commented Jul 9, 2024 at 14:38

Stack Exchange Network

DNS re-resolution and failover of TCP connections

1 Answer 1

You must log in to answer this question.

Hot Network Questions

DNS re-resolution and failover of TCP connections

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions