I have been struggling with the VPN setup between on-prem and GCP for more than a week. I am completely out of ideas at this point, and would love to get some help of network specialists.
Goal
The end goal is simple: to get a VM instance on GCP to seamlessly talk to a VM on-prem - but with 2 routers in play.
The setup is something like below:
GCP_VM OP_VM 10.0.0.25 10.100.0.200 | | | (DC Router Gateway) | 10.100.0.80 | | └-- HA_VPN (AS65001) <==========> Router (AS65002) --┘ Public IP: xx.xx.xx.xx yy.yy.yy.yy Advertise: 10.0.0.0/24 BGP 10.100.0.0/24 BGP VPN IP Range: 169.254.0.1/30 169.254.0.2 (as Peer) Private IP: NA 10.100.0.50 The complication here is that Router here is not directly connected to OP_VM. This is the on-prem setup we have no control over. OP_VM gets its IP 10.100.0.200 from some other router, and our Router is put on to the same LAN. We only get a single rack in the data centre, and need to reach OP_VM which is hosted by other party (in some other rack). Our rack is associated with 10.100.0.50.
And with this, I want to be able to get the below work:
me@GCP_VM:10.0.0.25:~$ ping 10.100.0.200 Current Status
With the above setup, VPN and BGP seem healthy from the logs on both sides.
From GCP_VM, I can ping 10.100.0.50 (Router) successfully.
me@GCP_VM:10.0.0.25:~$ ping 10.100.0.50 PING 10.100.0.50 (10.100.0.50) 56(84) bytes of data. 64 bytes from 10.100.0.50: icmp_seq=1 ttl=254 time=24.9 ms ... Also, from Router, I could confirm I can ping 10.100.0.200 (OP_VM).
# With the Router setup of something like # # ip route 10.100.0.0/24 gateway 10.100.0.80 root@Router:10.100.0.50:~$ ping 10.100.0.200 ping 10.100.0.200 received from 10.100.0.200: icmp_seq=0 ttl=63 time=0.583ms received from 10.100.0.200: icmp_seq=1 ttl=63 time=0.571ms 2 packets transmitted, 2 packets received, 0.0% packet loss round-trip min/avg/max = 0.571/0.577/0.583 ms From GCP_VM, though, ping to 10.100.0.200 (OP_VM) goes missing.
# With the Router setup of something like # # ip route 10.100.0.0/24 gateway 10.100.0.80 me@GCP_VM:10.0.0.25:~$ ping 10.100.0.200 PING 10.100.0.200 (10.100.0.200) 56(84) bytes of data. ^C --- 10.100.0.200 ping statistics --- 4 packets transmitted, 0 received, 100% packet loss, time 3051ms I'm probably misunderstanding the gateway setup, but changing the route like below gives me a different result:
# With the Router setup of something like # # ip route 10.100.0.0/24 gateway 10.100.0.50 # ~~ <- Router itself me@GCP_VM:10.0.0.25:~$ ping 10.100.0.200 PING 10.100.0.200 (10.100.0.200) 56(84) bytes of data. From 169.254.0.2 icmp_seq=7 Destination Host Unreachable From 169.254.0.2 icmp_seq=6 Destination Host Unreachable From 169.254.0.2 icmp_seq=5 Destination Host Unreachable From 169.254.0.2 icmp_seq=4 Destination Host Unreachable From 169.254.0.2 icmp_seq=3 Destination Host Unreachable From 169.254.0.2 icmp_seq=2 Destination Host Unreachable From 169.254.0.2 icmp_seq=1 Destination Host Unreachable ^C --- 10.100.0.200 ping statistics --- 9 packets transmitted, 0 received, +7 errors, 100% packet loss, time 8141ms pipe 7 With this gateway setup, Router can no longer ping OP_VM. This at least seems to me that VPN is established and IP is advertised correctly. But this does not look right from the actual networking point of view.
Questions
I don't think there is much more to be done on GCP side, and the issue seems to be purely on the on-prem.
Is there any setup issues, or concerns that may cause misbehaviour of VPN, BGP, ARP, etc.? What would cause such a case where routes seem to be shared, but cannot actually access them?
Other Notes
- I have confirmed the ARP table on
Routerincludes10.100.0.200 - I can see the routes propagated in GCP
- I have tested with GCP VPC's Firewall setup to allow
169.254.0.0/30and10.100.0.0/24 - I will need access from GKE in the end, but I have confirmed GKE is getting the same exact behaviour as
GCP_VM Routeris from Yamaha- Tried TCPdump (
packetdumpin Yamaha routers), but did not see10.0.0.25in the log - TCPdump did show the trace of
10.0.0.25when I rannmap -Pn 10.100.0.200fromGCP_VM, but with single line like this:
2019/12/21 16:35:40: LAN1 OUT:IP TCP 10.100.0.227:50516 > 10.103.24.1:80 Update (24th Dec)
I have done tcpdump for simple ping between GCP_VM and Router.
From GCP_VM to Router (logs from GCP_VM)
$ ping 10.100.0.50 > /dev/null & $ sudo tcpdump -i eth0 | grep 10.100 ... 18:49:18.696178 IP GCP_VM.(snip) > 10.100.0.50: ICMP echo request , id 32396, seq 0, length 64 18:49:18.700395 IP 10.100.0.50 > GCP_VM.(snip): ICMP echo reply, id 32396, seq 0, length 64 From Router to GCP_VM (logs from GCP_VM)
# ping from Router, with `ping 10.0.0.25` $ sudo tcpdump -i eth0 | grep 169.254 ... 18:40:18.554555 IP 169.254.0.2 > GCP_VM.(snip): ICMP echo request, id 3369, seq 0, length 72 18:40:18.554586 IP GCP_VM.(snip) > 169.254.0.2: ICMP echo reply, i d 3369, seq 0, length 72 Although tcpdump shows the reply is being sent here, it is never received by Router.
Also, ping to 169.254.0.2 from GCP_VM gets no reply.
$ ping 169.254.0.2 > /dev/null & $ sudo tcpdump -i eth0 | grep 169.254 ... 18:59:07.113101 IP GCP_VM.(snip) > 169.254.0.2: ICMP echo request, i d 32531, seq 0, length 64 18:59:08.137103 IP GCP_VM.(snip) > 169.254.0.2: ICMP echo request, i d 32531, seq 1, length 64 ... Update (27th Dec)
Ping from the Router was successful after setting its source address to 10.100.0.50, as it was trying to use 169.254.0.2 by default.
The ping still doesn't reach OP_VM, and I'm still facing NAT configuration issue to ensure the translation goes correctly.
Update (31st Dec)
The connection has been finally set up. I'll be summarising the steps taken in a separate answer to declutter the question.