-
- Notifications
You must be signed in to change notification settings - Fork 110
fix(transport): handle duplicate intro packets during gateway handshake #2293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @claude Please do a skeptical review of this PR |
| Claude finished @sanity's task in 1m 53s —— View job Skeptical PR Review
Critical Issues1. Magic Number Approach is Fragile 🔴The fix relies on checking
Better approach: Actually decrypt the packet and check if it's a valid RSA intro (similar to the 2. Test Coverage Doesn't Actually Validate The Fix 🟡The test
What's needed: A test that:
3. Continue on Decryption Failure is Too Permissive 🟡The new code continues waiting on any decryption failure: Err(_) => { tracing::debug!(..., "Failed to decrypt packet with inbound key, continuing to wait"); continue; }This means if the peer sends a completely malformed packet, or if there's a genuine key mismatch, the gateway will silently wait until timeout instead of failing fast. The original code failed immediately on decryption errors, which provided clearer feedback. Concern: This could mask real bugs (e.g., crypto key derivation issues) by turning hard errors into timeouts. 4. Root Cause May Not Be Addressed 🟡The PR description says duplicate intro packets get "routed to the // Line 550-557: Check if gateway connection already in progress if ongoing_gw_connections.contains_key(&remote_addr) { tracing::debug!( "Gateway connection already in progress, ignoring duplicate packet" ); continue; }There's already a guard that should prevent duplicate intros from being routed to Questions:
5. The 5-Second Deadline Behavior Changed 🟡Original code:
New code:
This means legitimate timeout scenarios (peer died, network issue) now take the full 5 seconds instead of failing fast. While this might be acceptable, it's a behavior change that should be documented. Minor Issues6. Inconsistent Error MessagesThe new debug log says "likely duplicate RSA intro" but doesn't actually verify it's an RSA intro—it just checks the size. More accurate wording: "Ignoring 256-byte packet (potential duplicate RSA intro or coincidentally sized packet)" 7. Missing Metrics/ObservabilityIf duplicate intros are a common occurrence (which the PR implies), there should be a counter or metric tracking how often this happens. This would help:
Questions for Author
Recommended ChangesOption A (Quick Fix - Improve Current Approach):
Option B (Better Fix - Address Root Cause):
I'd lean toward Option B as the more robust solution, but Option A might be acceptable if the root cause investigation is too complex. VerdictThe fix is a reasonable workaround but feels like it's treating a symptom rather than the disease. The magic number approach is fragile, and the lack of specific test coverage is concerning. I'd recommend either:
Without addressing these concerns, this fix might mask the real issue and create subtle bugs down the line (e.g., legitimate 256-byte packets being dropped). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a bug where duplicate RSA intro packets during NAT traversal cause "invalid symmetric key" errors when a peer reconnects with a new identity. The gateway now gracefully handles these duplicate packets by filtering 256-byte packets (RSA intro size) and continuing to wait for the actual symmetric ACK response.
Key changes:
- Modified the gateway handshake to loop with a deadline instead of single timeout
- Added logic to ignore 256-byte packets (duplicate RSA intros) during ACK wait
- Changed error handling to continue on decryption failures rather than immediately failing
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| match timeout_result { | ||
| Ok(Ok(packet)) => { | ||
| // Issue #2292: Check if this is a duplicate RSA intro packet (256 bytes). | ||
| // The peer sends multiple intro packets for reliability during NAT traversal. | ||
| // After we detect a new identity and create a new gateway_connection, | ||
| // these duplicate intros get routed here instead of creating another connection. | ||
| // We must ignore them and continue waiting for the actual ACK response. | ||
| if packet.data().len() == RSA_INTRO_PACKET_SIZE { | ||
| tracing::debug!( | ||
| peer_addr = %remote_addr, | ||
| direction = "inbound", | ||
| packet_len = packet.data().len(), | ||
| "Ignoring 256-byte packet during gateway handshake (likely duplicate RSA intro)" | ||
| ); | ||
| continue; | ||
| } | ||
| | ||
| // Try to decrypt as symmetric ACK response | ||
| match packet.try_decrypt_sym(&inbound_key) { | ||
| Ok(_) => { | ||
| // Successfully decrypted - this is the ACK we're waiting for | ||
| break; | ||
| } | ||
| Err(_) => { | ||
| tracing::debug!( | ||
| peer_addr = %remote_addr, | ||
| direction = "inbound", | ||
| packet_len = packet.data().len(), | ||
| "Failed to decrypt packet with inbound key, continuing to wait" | ||
| ); | ||
| // Continue waiting for a valid ACK | ||
| continue; | ||
| } | ||
| } | ||
| } |
Copilot AI Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing on decryption failures changes the error handling behavior from the original implementation. The original code would immediately fail on the first decryption error, which helped detect protocol mismatches or corruption early.
With this change, legitimate protocol errors (e.g., corrupted packets, protocol version mismatches) will be silently ignored until the 5-second timeout expires. This could make debugging connection issues more difficult.
Consider adding a counter to limit the number of decryption failures allowed (e.g., max 3-5 failures) before returning an error. This would maintain the fix for duplicate RSA intro packets while still catching legitimate protocol errors quickly.
| match timeout_result { | |
| Ok(Ok(packet)) => { | |
| // Issue #2292: Check if this is a duplicate RSA intro packet (256 bytes). | |
| // The peer sends multiple intro packets for reliability during NAT traversal. | |
| // After we detect a new identity and create a new gateway_connection, | |
| // these duplicate intros get routed here instead of creating another connection. | |
| // We must ignore them and continue waiting for the actual ACK response. | |
| if packet.data().len() == RSA_INTRO_PACKET_SIZE { | |
| tracing::debug!( | |
| peer_addr = %remote_addr, | |
| direction = "inbound", | |
| packet_len = packet.data().len(), | |
| "Ignoring 256-byte packet during gateway handshake (likely duplicate RSA intro)" | |
| ); | |
| continue; | |
| } | |
| // Try to decrypt as symmetric ACK response | |
| match packet.try_decrypt_sym(&inbound_key) { | |
| Ok(_) => { | |
| // Successfully decrypted - this is the ACK we're waiting for | |
| break; | |
| } | |
| Err(_) => { | |
| tracing::debug!( | |
| peer_addr = %remote_addr, | |
| direction = "inbound", | |
| packet_len = packet.data().len(), | |
| "Failed to decrypt packet with inbound key, continuing to wait" | |
| ); | |
| // Continue waiting for a valid ACK | |
| continue; | |
| } | |
| } | |
| } | |
| // Add a counter for decryption failures | |
| let mut decryption_failures = 0; | |
| const MAX_DECRYPTION_FAILURES: u32 = 3; | |
| loop { | |
| match timeout_result { | |
| Ok(Ok(packet)) => { | |
| // Issue #2292: Check if this is a duplicate RSA intro packet (256 bytes). | |
| // The peer sends multiple intro packets for reliability during NAT traversal. | |
| // After we detect a new identity and create a new gateway_connection, | |
| // these duplicate intros get routed here instead of creating another connection. | |
| // We must ignore them and continue waiting for the actual ACK response. | |
| if packet.data().len() == RSA_INTRO_PACKET_SIZE { | |
| tracing::debug!( | |
| peer_addr = %remote_addr, | |
| direction = "inbound", | |
| packet_len = packet.data().len(), | |
| "Ignoring 256-byte packet during gateway handshake (likely duplicate RSA intro)" | |
| ); | |
| continue; | |
| } | |
| // Try to decrypt as symmetric ACK response | |
| match packet.try_decrypt_sym(&inbound_key) { | |
| Ok(_) => { | |
| // Successfully decrypted - this is the ACK we're waiting for | |
| break; | |
| } | |
| Err(_) => { | |
| decryption_failures += 1; | |
| tracing::debug!( | |
| peer_addr = %remote_addr, | |
| direction = "inbound", | |
| packet_len = packet.data().len(), | |
| decryption_failures, | |
| "Failed to decrypt packet with inbound key, continuing to wait" | |
| ); | |
| if decryption_failures >= MAX_DECRYPTION_FAILURES { | |
| return Err(TransportError::ConnectionEstablishmentFailure { | |
| cause: format!( | |
| "failed to decrypt {} packets with inbound key (max allowed: {})", | |
| decryption_failures, MAX_DECRYPTION_FAILURES | |
| ).into(), | |
| }); | |
| } | |
| // Continue waiting for a valid ACK | |
| continue; | |
| } | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This concern is now addressed by the revised approach.
The fix has been moved to the routing layer (line 504), and the original handler code has been restored. The handler still fails fast on decryption errors - no behavior change.
Duplicate 256-byte RSA intro packets are filtered before they reach the handler, so this scenario no longer occurs.
[AI-assisted - Claude]
| return Err(TransportError::ConnectionEstablishmentFailure { | ||
| cause: "connection timed out waiting for response".into(), | ||
| }); | ||
| } | ||
| | ||
| let timeout_result = | ||
| tokio::time::timeout(remaining, next_inbound.recv_async()).await; | ||
| | ||
| match timeout_result { | ||
| Ok(Ok(packet)) => { | ||
| // Issue #2292: Check if this is a duplicate RSA intro packet (256 bytes). | ||
| // The peer sends multiple intro packets for reliability during NAT traversal. | ||
| // After we detect a new identity and create a new gateway_connection, | ||
| // these duplicate intros get routed here instead of creating another connection. | ||
| // We must ignore them and continue waiting for the actual ACK response. | ||
| if packet.data().len() == RSA_INTRO_PACKET_SIZE { | ||
| tracing::debug!( | ||
| peer_addr = %remote_addr, | ||
| direction = "inbound", | ||
| packet_len = packet.data().len(), | ||
| "Ignoring 256-byte packet during gateway handshake (likely duplicate RSA intro)" | ||
| ); | ||
| continue; | ||
| } | ||
| | ||
| // Try to decrypt as symmetric ACK response | ||
| match packet.try_decrypt_sym(&inbound_key) { | ||
| Ok(_) => { | ||
| // Successfully decrypted - this is the ACK we're waiting for | ||
| break; | ||
| } | ||
| Err(_) => { | ||
| tracing::debug!( | ||
| peer_addr = %remote_addr, | ||
| direction = "inbound", | ||
| packet_len = packet.data().len(), | ||
| "Failed to decrypt packet with inbound key, continuing to wait" | ||
| ); | ||
| // Continue waiting for a valid ACK | ||
| continue; | ||
| } | ||
| } | ||
| } | ||
| Ok(Err(_)) => { | ||
| return Err(TransportError::ConnectionEstablishmentFailure { | ||
| cause: "connection closed".into(), | ||
| }); | ||
| } | ||
| Err(_) => { | ||
| return Err(TransportError::ConnectionEstablishmentFailure { | ||
| cause: "connection timed out waiting for response".into(), | ||
| }); |
Copilot AI Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error messages at lines 905 and 955 are identical: "connection timed out waiting for response". However, they represent different scenarios:
- Line 905: The deadline has been reached before attempting to receive a packet
- Line 955: A timeout occurred while waiting for a packet on the channel
Consider making these error messages more specific to aid debugging. For example:
- Line 905: "connection deadline reached waiting for ACK response"
- Line 955: "timeout receiving packet from channel"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This concern is now moot with the revised approach.
The fix has been moved to the routing layer, and the original handler code (with the single timeout) has been restored. The error messages remain unchanged from before this PR.
[AI-assisted - Claude]
d3e9693 to 20a3016 Compare Revised Fix Based on Review FeedbackThank you for the thorough skeptical review! You raised excellent points. I've revised the approach based on your feedback. What ChangedOriginal approach (now discarded):
Revised approach (Option B from your review):
Why This Is Better
Addressing Your Questions
The guard at line 550 prevents creating new gateway connections. But packets for existing
Exactly right - that's what this revised fix does! The 256-byte Filter is Safe BecauseDuring gateway handshake, we're waiting for a symmetric ACK response. Symmetric ACK packets:
So any 256-byte packet arriving at [AI-assisted - Claude] |
20a3016 to 1ab1f81 Compare Found the Real Root Cause!Your skepticism was warranted. Filtering duplicate intros alone wouldn't explain why the problem persisted until gateway restart. The Smoking Gun: Line 675if let TransportError::ConnectionEstablishmentFailure { cause } = error { cause.starts_with("remote is using a different protocol version"); // Does NOTHING! outdated_peer.insert(remote_addr, Instant::now()); // Always runs! }This is a standalone boolean expression that's discarded. It should be: if cause.starts_with("remote is using a different protocol version") { outdated_peer.insert(remote_addr, Instant::now()); }Full Bug Chain
The duplicate intro fix is still valuable (prevents the initial error), but this Both fixes are now included in the PR. [AI-assisted - Claude] |
| @claude Please re-review as there have been additional changes since your last review |
| Claude finished @sanity's task in 2m 0s —— View job Re-reviewing PR Changes
Review SummaryThis revision is significantly better than the initial approach. Both fixes are now minimal, targeted, and address real root causes. The author correctly identified and fixed two distinct bugs that combined to cause the reconnection failure. Fix 1: Duplicate RSA Intro Packet Filtering (Line 504) ✅What changed: if let Some(inbound_packet_sender) = ongoing_gw_connections.get(&remote_addr) { // NEW: Filter 256-byte RSA intro packets before routing to handler if size == RSA_INTRO_PACKET_SIZE { tracing::debug!(..., "Ignoring duplicate RSA intro packet during gateway handshake"); continue; } // Route to handler... }Why this works:
Correctness analysis:
Previous review concern addressed:
✅ Fixed. The revised approach filters at the routing layer (line 504), not in the consumption handler. Minor concern:
A more robust approach would be to actually try RSA decryption to confirm it's an intro packet, but the complexity isn't warranted given the low false-positive risk. Fix 2: Conditional |
Two bugs caused peers to be unable to reconnect after restarting with a new identity: 1. **Duplicate RSA intro packets misrouted (line 504)**: When a peer restarts, the gateway detects the new identity and creates a new gateway_connection. But the peer sends multiple RSA intro packets for NAT traversal reliability. These duplicate intros were routed to the new handler, which expected symmetric ACK responses, causing "invalid symmetric key" errors. Fix: Filter 256-byte packets at the routing layer - during gateway handshake, we're waiting for symmetric ACKs which are not 256 bytes. 2. **Buggy outdated_peer blocking (line 675)**: When ANY gateway connection failed, the peer was added to outdated_peer and blocked for 10 minutes. The code was: cause.starts_with("remote is using a different protocol version"); outdated_peer.insert(remote_addr, Instant::now()); The starts_with() evaluated to a boolean that was discarded - the insert always ran. This should only block protocol version mismatches. Fix: Wrap the insert in an if statement. Together these bugs meant: first reconnection attempt failed (bug 1), then all subsequent attempts were blocked for 10 minutes (bug 2). This explains why the issue only resolved after gateway restart (which clears the outdated_peer map). Fixes #2292 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> 1ab1f81 to 8ca6cd8 Compare | Thanks for the thorough re-review! I've addressed the minor nit:
Added clarifying comment: // Check if we already have a gateway connection in progress. // Note: This guard prevents CREATING duplicate connections. Packets for // EXISTING ongoing connections are routed above at line 504, where we // filter duplicate RSA intro packets (issue #2292).Ready for merge. [AI-assisted - Claude] |
| if let TransportError::ConnectionEstablishmentFailure { cause } = error { | ||
| cause.starts_with("remote is using a different protocol version"); | ||
| outdated_peer.insert(remote_addr, Instant::now()); | ||
| if cause.starts_with("remote is using a different protocol version") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this stuff is very britle, I think we need to move to better typed error catching so we can distinguish properly
Replace fragile string matching with proper typed error matching for protocol version mismatches: 1. Change `gateway_connection` to return `TransportError::ProtocolVersionMismatch` instead of `ConnectionEstablishmentFailure` with a string cause 2. Change `outdated_peer` check to use `matches!()` on the typed error instead of `cause.starts_with("remote is using a different protocol version")` This addresses Nacho's feedback on PR #2293 about the brittleness of string-based error detection. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Problem
When a peer restarts with a new identity, the gateway correctly detects this (#2277 fix), but the peer becomes unable to reconnect until the gateway is restarted. Two bugs cause this:
Bug 1: Duplicate RSA intro packets misrouted (line 504)
When the gateway detects a new identity, it creates a new
gateway_connection. But the peer sends multiple RSA intro packets for NAT traversal reliability. These duplicate intros were routed to the new handler, which expected symmetric ACK responses:is_new_identitydetected → newgateway_connectioncreatedtry_decrypt_sym()on RSA packet → fails with "invalid symmetric key"Bug 2: Buggy
outdated_peerblocking (line 675) - ROOT CAUSEWhen ANY gateway connection failed, the peer was blocked for 10 minutes:
The
cause.starts_with(...)evaluated to a boolean that was discarded - the insert always ran. This was intended to only block peers with incompatible protocol versions.Why gateway restart fixed it
The
outdated_peermap is a local variable - gateway restart clears it.This Solution
Fix 1: Filter duplicate intros at routing layer (line 504)
Fix 2: Only block protocol version mismatches (line 675)
Testing
connection_handlertests pass (14/14)gateway_handles_peer_reconnection_same_addr_new_identitytest validates the peer restart scenarioFixes
Closes #2292
[AI-assisted - Claude]