Overview
This case study details the investigation and resolution of a routing issue that affected high availability of a site-to-site VPN connection between an AWS-hosted infrastructure and an on-premises data center. The VPN was established to facilitate secure integration with a third-party API service.
Background
As part of a feature implementation project, my team needed to integrate with a third-party API service that required communication over a VPN tunnel. The third party maintained an on-premises environment, while our infrastructure was hosted on Amazon Web Services (AWS). The integration was to be achieved through a site-to-site VPN connection, ensuring a secure and private communication channel between both networks.
The service provider supplied the necessary VPN configuration details, and I was responsible for setting up the AWS side of the connection. The configuration involved creating a Customer Gateway (CGW), a Virtual Private Gateway (VGW), and establishing the site-to-site VPN connection using the parameters provided. Additionally, I updated the route tables to ensure all traffic destined for the on-prem network was directed through the VPN tunnel.
To ensure a high availability setup, I configured two VPN tunnels. The secondary tunnel was intended to take over automatically in the event of a failure or maintenance on the primary tunnel.
Implementation and Monitoring
After configuring the tunnels and setting the pre-shared keys, both tunnels came up successfully. Connectivity was validated by initiating a telnet connection to the on-premises server’s private IP and port, which confirmed successful communication.
To maintain visibility and proactive alerting, I enabled Amazon CloudWatch metrics for both VPN tunnels and configured alarms to notify the team whenever either tunnel went down. With the setup complete and all tests passed, the system was moved into production.
Incident
Some time after deployment, the primary VPN tunnel (Tunnel 1) was taken down for routine maintenance. Shortly after, our monitoring system triggered an alert indicating Tunnel 1 was down. However, despite having Tunnel 2 configured for redundancy, transaction failures began to occur.
Initial troubleshooting revealed that although both tunnels were in an “up” state, communication through Tunnel 2 was not successful. It became evident that, until that point, all network traffic had been routed exclusively through Tunnel 1, and failover was not functioning as expected.
To restore services and minimize downtime, I reactivated Tunnel 1. This restored connectivity and resumed transactions, but it confirmed that the failover mechanism was not working correctly.
Root Cause Analysis
Following service restoration, I initiated a root cause analysis (RCA) session with the third-party service provider. Preliminary checks confirmed that both VPN tunnels were established and active. To gather more insight, I enabled VPN logging and reviewed the CloudWatch log group, as well as the TunnelDataIn and TunnelDataOut metrics for both tunnels.
The metrics revealed a key pattern:
Tunnel 1: Active inbound and outbound data flow.
Tunnel 2: Outbound data during failover attempts, but no inbound responses.
This indicated that traffic from our side was reaching the on-premises network through Tunnel 2, but the responses were not being routed back correctly.
Upon further investigation with the provider, we identified the root cause: the issue was route-based. The customer’s VPN configuration was correctly receiving traffic from our secondary tunnel, but due to an incorrect internal routing setup, return traffic was being sent to the wrong internal server. This misconfiguration prevented successful bidirectional communication during failover events.
Resolution
Once the routing issue on the customer’s side was corrected, we simulated another maintenance event by intentionally bringing down Tunnel 1. This time, a failover occurred seamlessly, traffic successfully transitioned to the secondary Tunnel without service interruption. Monitoring metrics confirmed continuous bidirectional data flow through the secondary tunnel.
Outcome
After the corrective action, VPN failover worked as intended, significantly improving service reliability. The resolution reduced downtime incidents and minimized business impact by 45%, ensuring a stable and redundant communication channel between AWS and the on-premises infrastructure.
Key Takeaways
High Availability Validation: Regular failover testing is essential even when VPN tunnels report as active.
Comprehensive Monitoring: CloudWatch metrics provided critical visibility that aided in identifying asymmetric traffic flow.
Collaborative Troubleshooting: Effective communication between AWS and on-prem network teams was key to resolving the issue quickly.
Route Verification: Route-based VPNs require careful verification on both ends to ensure symmetrical routing.
All rights reserved. This article was written by Feyisayo Lasisi and is protected by copyright3
Top comments (0)