Skip to content

fix: GET request responses not reaching requesting peer (flaky test_put_contract_three_hop_returns_response) #2241

@sanity

Description

@sanity

Problem

GET request responses are being sent but not reaching the requesting peer, causing timeouts. This manifests as flaky CI failures in test_put_contract_three_hop_returns_response and production GET request timeouts.

Symptoms

  1. CI flakiness: test_put_contract_three_hop_returns_response fails intermittently with "Timeout waiting for get response"
  2. Production issues: GET requests timeout even when the contract exists on the network

What the logs show

From CI failure analysis:

21:42:00.826 - peer-c sends RequestGet to peer-a (via routing) 21:42:00.828 - peer-a receives RequestGet 21:42:00.829 - peer-a sends ReturnGet back to peer-c via explicit address (NAT routing) 21:42:00.829 - "Message successfully sent to peer connection via explicit address" ... 45 seconds pass ... 21:42:45.827 - "Attempt 2/3 to GET from peer C" (timeout, retry) ... same pattern repeats ... 21:43:32.829 - "Attempt 3/3 to GET from peer C" ... final timeout, test fails ... 

The response is logged as "successfully sent" but never arrives at peer-c.

Network topology in test

gateway <---> peer-a (has contract) <---> peer-c (requesting) 
  • peer-c initiates GET
  • Request routes through to peer-a (contract location)
  • peer-a finds contract, sends ReturnGet
  • ReturnGet never reaches peer-c

Hypothesis

The issue appears to be in the return path. Possibilities:

  1. Connection lookup mismatch: The connection used to send the response may not be the same connection peer-c is listening on
  2. NAT routing address confusion: The target_addr used for "explicit address" routing may be stale or incorrect
  3. Message serialization/delivery: The message is queued but not actually delivered
  4. Channel closure: The receiving channel on peer-c may be closed or not being polled

Key code paths to investigate

  • handle_notification_msg in p2p_protoc - handles routing of responses
  • send_to_peer_connection - the "successfully sent" log comes from here
  • Connection management - how connections are looked up by address
  • The conn_bridge_rx channel handling for outbound messages

Reproduction

Run the test multiple times:

for i in {1..10}; do cargo test -p freenet test_put_contract_three_hop_returns_response -- --nocapture 2>&1 | tail -5 done

Fails ~50% of the time in CI.

Impact

  • Blocks release 0.1.44 (PR build: release 0.1.44 #2240)
  • Affects production reliability of GET operations
  • Related to overall network message delivery reliability

Related

[AI-assisted - Claude]

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-networkingArea: Networking, ring protocol, peer discoveryE-hardExperience needed to fix/implement: Hard / a lotP-criticalCritical priorityS-blockedStatus: Blocked by external dependency or other issueT-bugType: Something is broken

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions