0

I manage some stateless services that work over TCP, and we are working to ensure reliability and recoverability in case of a regional outage. I find that I lack enough understanding about the lifecycle of a TCP connection, DNS resolution for the connection, and the way a client library would implement this. I want to understand how these components behave in a few failure modes.

Components of the infrastructure:

  • Service A deployed in two different regions. Region X and Y.
  • Let's say Service A in Region X has IP address IP_A_X, and Service A in Region Y has IP_A_Y.
  • Geographically distributed DNS servers, also in Region X and Y.
  • A Geo-aware DNS name that resolves IP_A_X for requests coming from region X, and IP_A_Y for region Y.

The Geo-aware DNS infrastructure health-checks Service A in each region:

  • If Service A goes down in Region X, then it will fail health checks, and our Geo-aware DNS servers will start returning IP_A_Y for all DNS requests.

I want to understand a couple scenarios and questions about a failover from Region X to Y.


Consider this simple scenario:

  1. Client W starts in region X and opens a persistent connection to IP_A_X
  2. Service A dies in region X while the connection above remains live
  3. Soon enough, our Geo-aware DNS notices the death of Service A in region X and starts returning IP_A_Y to all requests.

My questions:

  1. Will Client W ever re-resolve the DNS name for Service A? Is this client-library dependent? Or does this happen at the OS-level?
  2. What will be the time diagram/series of steps between application,client library and OS for Client W to re-resolve DNS?
  3. If Client W will just continue to retry connections to IP_A_X - what would be a usual way to trigger the re-resolution of DNS?

For simplicity, let's assume some recent Linux as OS, and if needed, a specific client library / language of your choice?

1

1 Answer 1

0

The case that "Service A dies in region X while the connection above (to service A) remains live" can only happen if the connection is idle, i.e. no data transfer from the client is initiated and no transport from server to client is expected. Without this condition a properly implemented client will realize, that the connection is broken either because transmission fails or expected data from server fail to happen. With TCP keep-alive the client can even detect if an idle connection is broken.

The client will only try another connection if it has realized that the existing connection is broken. If the reconnection involves a new DNS lookup and if a reconnection is attempted in the first place (instead of for example throwing an error) depends on the client implementation. But I expect the majority of clients to start with a fresh DNS lookup on reconnect.

2
  • I appreciate your response, however it doesn't go to the level of detail I'm looking for. What you described is very close to what I already understand. Commented Jul 9, 2024 at 14:11
  • @Pablo: Given that the behavior is client dependent (1) the sub-questions (2) and (3) cannot be answered. As for explicit re-triggering DNS resolution - impossible or client dependent. Some will retry DNS if the resolved IP fails to reconnect. As for "What you described is very close to what I already understand." - to avoid getting what you already know it would have been helpful to actually add to your question what you already know instead of just stating questions. Commented Jul 9, 2024 at 14:38

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.