1

I have been struggling with the VPN setup between on-prem and GCP for more than a week. I am completely out of ideas at this point, and would love to get some help of network specialists.

Goal

The end goal is simple: to get a VM instance on GCP to seamlessly talk to a VM on-prem - but with 2 routers in play.
The setup is something like below:

 GCP_VM OP_VM 10.0.0.25 10.100.0.200 | | | (DC Router Gateway) | 10.100.0.80 | | └-- HA_VPN (AS65001) <==========> Router (AS65002) --┘ Public IP: xx.xx.xx.xx yy.yy.yy.yy Advertise: 10.0.0.0/24 BGP 10.100.0.0/24 BGP VPN IP Range: 169.254.0.1/30 169.254.0.2 (as Peer) Private IP: NA 10.100.0.50 

The complication here is that Router here is not directly connected to OP_VM. This is the on-prem setup we have no control over. OP_VM gets its IP 10.100.0.200 from some other router, and our Router is put on to the same LAN. We only get a single rack in the data centre, and need to reach OP_VM which is hosted by other party (in some other rack). Our rack is associated with 10.100.0.50.

And with this, I want to be able to get the below work:

me@GCP_VM:10.0.0.25:~$ ping 10.100.0.200 

Current Status

With the above setup, VPN and BGP seem healthy from the logs on both sides.

From GCP_VM, I can ping 10.100.0.50 (Router) successfully.

me@GCP_VM:10.0.0.25:~$ ping 10.100.0.50 PING 10.100.0.50 (10.100.0.50) 56(84) bytes of data. 64 bytes from 10.100.0.50: icmp_seq=1 ttl=254 time=24.9 ms ... 

Also, from Router, I could confirm I can ping 10.100.0.200 (OP_VM).

# With the Router setup of something like # # ip route 10.100.0.0/24 gateway 10.100.0.80 root@Router:10.100.0.50:~$ ping 10.100.0.200 ping 10.100.0.200 received from 10.100.0.200: icmp_seq=0 ttl=63 time=0.583ms received from 10.100.0.200: icmp_seq=1 ttl=63 time=0.571ms 2 packets transmitted, 2 packets received, 0.0% packet loss round-trip min/avg/max = 0.571/0.577/0.583 ms 

From GCP_VM, though, ping to 10.100.0.200 (OP_VM) goes missing.

# With the Router setup of something like # # ip route 10.100.0.0/24 gateway 10.100.0.80 me@GCP_VM:10.0.0.25:~$ ping 10.100.0.200 PING 10.100.0.200 (10.100.0.200) 56(84) bytes of data. ^C --- 10.100.0.200 ping statistics --- 4 packets transmitted, 0 received, 100% packet loss, time 3051ms 

I'm probably misunderstanding the gateway setup, but changing the route like below gives me a different result:

# With the Router setup of something like # # ip route 10.100.0.0/24 gateway 10.100.0.50 # ~~ <- Router itself me@GCP_VM:10.0.0.25:~$ ping 10.100.0.200 PING 10.100.0.200 (10.100.0.200) 56(84) bytes of data. From 169.254.0.2 icmp_seq=7 Destination Host Unreachable From 169.254.0.2 icmp_seq=6 Destination Host Unreachable From 169.254.0.2 icmp_seq=5 Destination Host Unreachable From 169.254.0.2 icmp_seq=4 Destination Host Unreachable From 169.254.0.2 icmp_seq=3 Destination Host Unreachable From 169.254.0.2 icmp_seq=2 Destination Host Unreachable From 169.254.0.2 icmp_seq=1 Destination Host Unreachable ^C --- 10.100.0.200 ping statistics --- 9 packets transmitted, 0 received, +7 errors, 100% packet loss, time 8141ms pipe 7 

With this gateway setup, Router can no longer ping OP_VM. This at least seems to me that VPN is established and IP is advertised correctly. But this does not look right from the actual networking point of view.

Questions

I don't think there is much more to be done on GCP side, and the issue seems to be purely on the on-prem.

Is there any setup issues, or concerns that may cause misbehaviour of VPN, BGP, ARP, etc.? What would cause such a case where routes seem to be shared, but cannot actually access them?


Other Notes

  • I have confirmed the ARP table on Router includes 10.100.0.200
  • I can see the routes propagated in GCP
  • I have tested with GCP VPC's Firewall setup to allow 169.254.0.0/30 and 10.100.0.0/24
  • I will need access from GKE in the end, but I have confirmed GKE is getting the same exact behaviour as GCP_VM
  • Router is from Yamaha
  • Tried TCPdump (packetdump in Yamaha routers), but did not see 10.0.0.25 in the log
  • TCPdump did show the trace of 10.0.0.25 when I ran nmap -Pn 10.100.0.200 from GCP_VM, but with single line like this:
2019/12/21 16:35:40: LAN1 OUT:IP TCP 10.100.0.227:50516 > 10.103.24.1:80 

Update (24th Dec)

I have done tcpdump for simple ping between GCP_VM and Router.

From GCP_VM to Router (logs from GCP_VM)

$ ping 10.100.0.50 > /dev/null & $ sudo tcpdump -i eth0 | grep 10.100 ... 18:49:18.696178 IP GCP_VM.(snip) > 10.100.0.50: ICMP echo request , id 32396, seq 0, length 64 18:49:18.700395 IP 10.100.0.50 > GCP_VM.(snip): ICMP echo reply, id 32396, seq 0, length 64 

From Router to GCP_VM (logs from GCP_VM)

# ping from Router, with `ping 10.0.0.25` $ sudo tcpdump -i eth0 | grep 169.254 ... 18:40:18.554555 IP 169.254.0.2 > GCP_VM.(snip): ICMP echo request, id 3369, seq 0, length 72 18:40:18.554586 IP GCP_VM.(snip) > 169.254.0.2: ICMP echo reply, i d 3369, seq 0, length 72 

Although tcpdump shows the reply is being sent here, it is never received by Router.
Also, ping to 169.254.0.2 from GCP_VM gets no reply.

$ ping 169.254.0.2 > /dev/null & $ sudo tcpdump -i eth0 | grep 169.254 ... 18:59:07.113101 IP GCP_VM.(snip) > 169.254.0.2: ICMP echo request, i d 32531, seq 0, length 64 18:59:08.137103 IP GCP_VM.(snip) > 169.254.0.2: ICMP echo request, i d 32531, seq 1, length 64 ... 

Update (27th Dec)

Ping from the Router was successful after setting its source address to 10.100.0.50, as it was trying to use 169.254.0.2 by default.

The ping still doesn't reach OP_VM, and I'm still facing NAT configuration issue to ensure the translation goes correctly.

Update (31st Dec)

The connection has been finally set up. I'll be summarising the steps taken in a separate answer to declutter the question.

2
  • Please add your solution as you mentioned in your last update. Commented Jan 13, 2020 at 22:15
  • I have added further clarification and solution I had to put in place in an answer below. Commented Jan 15, 2020 at 13:06

2 Answers 2

2

It's looks like a routing problem on-prem. I think, OP_VM doesn't have a route to 10.0.0.0/24 and as result send it to the default gateway DC Router Gateway and there it's dropped because DC Router Gateway (10.100.0.80) also doesn't have route to 10.0.0.0/24 (because you have peering at Router).

To solve it you should set a static route at OP_VM to 10.0.0.0/24 via Router and keep DC Router Gateway as a default gateway.

You have to remove route ip route 10.100.0.0/24 gateway 10.100.0.50 from Router- network 10.100.0.0/24 is directly connected to him.

EDIT

From GCP_VM, I can ping 10.100.0.50 (Router) successfully.

At this point it looks like you have properly configured peering between Router and HA_VPN.

You should be able to ping GCP_VM and OP_VM from Router and also Router from OP_VM to be on a right path.

With the Router setup of something like

 ip route 10.100.0.0/24 gateway 10.100.0.80 

With the Router setup of something like

 ip route 10.100.0.0/24 gateway 10.100.0.80 

You don't need these routes because Router is directly connected to subnet 10.100.0.0/24 and has an IP 10.100.0.50

From GCP_VM, though, ping to 10.100.0.200 (OP_VM) goes missing.

It's expected because OP_VM and DC Router Gateway don't have a route to 10.0.0.0/24 as I mentioned above and can't reply and you have to set a static route at OP_VM to 10.0.0.0/24 via Router and keep DC Router Gateway as a default gateway.

EDIT 2 OP_VM sent replies to DC Router Gateway because it's doesn't have a route to 10.100.0.0/24 and it try to reach it via default gateway, and at DC Router Gateway they've dropped because there's no route also.

You should add a static route at OP_VM or at DC Router Gateway to 10.100.0.0/24 to solve it.

10
  • Thanks for the insight, I'm suspecting that scenario as well. 10.0.0.0/24 is advertised to the Router via BGP, and that should be sufficient for the routing. However, I am seeing the ping from Router to 10.0.0.25 failing, although the other way around works. There seems to be something missing in the Router, or potentially GCP FW setup. Commented Dec 24, 2019 at 1:58
  • Firewall at GCP should accept on-prem network 10.100.0.0/24 you advertise via BGP (must be in the routing table at Router and after peering atHA_VPN) and vise verse firewall on-prem should accept cloud network 10.0.0.0/24 (network 10.0.0.0/24 must be in the routing table of HA_VPN and after peering at Router). You shoud be able to ping everything in cloud from Router and vise versa. What's the default gateway of yor Router? Commented Dec 24, 2019 at 6:05
  • 1
    FW is set on GCP to allow the traffic to/from 10.100.0.0/24, and on-prem has the same for 10.0.0.0/24. I have done the tcpdump on both ends, and it does look like the ICMP going through both routes and passing the FW correctly. I have updated the original question with some more details. But now it looks like the route from GCP to 169.254.0.2 isn't going through. At this point, I'm not sure if this has to do with the main issue of GCP_VM not being able to reach 10.100.0.200 though... Commented Dec 24, 2019 at 19:07
  • If firewalls are configured properly check the routing tables at Router and HA_VPN. Why did you decided to use link-local IP addresses for peering? Commented Dec 24, 2019 at 21:18
  • 1
    That may well be the case. Neither OP_VM nor DC Router Gateway are under my control, so I will need to have a third party vendor to look into them. While I wait for the help from that end, I will see if NAT is going to help. Commented Dec 24, 2019 at 21:54
1

After much testing and debugging, I have resolved the connection issue between GCP and on-prem. The below is the steps taken, and also considerations made while pinpointing the problem.

Analyse Traffic from Both Directions

I was lacking the consideration to dissect the traffic from both directions. This means breaking down how each packet would travel from/to source/destination, and that would give clear view of where the root cause could lie.

Packet from GCP to on-prem

  1. Packet sent from GCP_VM (10.0.0.25) to HA_VPN (169.254.0.1/30)
  2. Packet sent from GCP_VM (10.0.0.25) to Router (AS65002) (10.100.0.50)
  3. Packet sent from GCP_VM (10.0.0.25) to DC Router Gateway (10.100.0.80)
  4. Packet sent from GCP_VM (10.0.0.25) to OP_VM (10.100.0.200)
  5. Packet sent from Router (AS65002) (10.100.0.50) to OP_VM (10.100.0.200)

Packets from on-prem to GCP (return route)

  1. Packet sent from OP_VM (10.100.0.200) to Router (AS65002) (10.100.0.50)
  2. Packet sent from OP_VM (10.100.0.200) to GCP_VM (10.0.0.25)
  3. Packet sent from Router (AS65002) (10.100.0.50) to GCP_VM (10.0.0.25)

Given the above checkpoints, the followings were the status:

  1. This was not tested, as Cloud VPN endpoint (169.254.0.1/30) was not part of routes in VPC
  2. I could confirm the ping hitting Router (AS65002) (10.100.0.50), and also response returned (this means the corresponding #9 is also confirmed)
  3. I could NOT confirm the ping hitting DC Router Gateway (10.100.0.80), as ping did not receive response
  4. I could NOT confirm the ping hitting OP_VM (10.100.0.200), as ping did not receive response
  5. I could confirm the ping hitting OP_VM (10.100.0.200), and also response returned (this means the corresponding #7 is also confirmed)
  6. As mentioned, #6 confirmed this traffic as well
  7. No traffic matches this case
  8. As mentioned, #2 confirmed this traffic as well

The below is the diagram to describe the situation

 GCP_VM HA_VPN (AS65001) Router (AS65002) (DC Router Gateway) OP_VM 10.0.0.25 10.100.0.50 10.100.0.80 10.100.0.200 1. NA 2. +--------------------------------> OK (response returned) 3. +----------------------------------------------------x NG? 4. +---------------------------------------------------------------------------x NG? 5. +-----------------------------------------> OK (response returned) 6. OK <-----------------------------------------+ 7. No matching traffic 8. OK <--------------------------------+ 

This clarifies that any traffic initiating from GCP_VM leaves possible 2 issues:

  • Possibility A. Packet is not reaching OP_VM (10.100.0.200)
  • Possibility B. Packet is reaching OP_VM (10.100.0.200), but response is not getting back to Router (AS65002) (10.100.0.50)

I could confirm with #5 and #6 above that the packet does reach OP_VM (10.100.0.200) when initiated at Router (AS65002) (10.100.0.50). That means, Possibility A is unlikely. The routing itself is working as it should.

This meant there is a high chance that ping response is lost, and never hitting Router (AS65002) (10.100.0.50) back. And for my specific case here, this Possibility B was the root cause of the problem.

I could confirm this is the case by creating a mock network mimicking the same setup as above, and using Wireshark to listen at each point. It meant that the diagram below is the actual case.

 GCP_VM HA_VPN (AS65001) Router (AS65002) (DC Router Gateway) OP_VM 10.0.0.25 10.100.0.50 10.100.0.80 10.100.0.200 1. NA 2. +--------------------------------> OK (response returned) 3. +----------------------------------------------------> OK 4. +---------------------------------------------------------------------------> OK 5. +-----------------------------------------> OK (response returned) 6. OK <-----------------------------------------+ 7. Where should I sent packet to?? LOST ---------------------------+ 8. OK <--------------------------------+ 

Solution

In my case, when the traffic initiates at GCP_VM, the source IP was set to 10.0.0.25. This meant, when OP_VM tries to send the traffic back, it couldn't find where 10.0.0.25, and packet was lost.

I had to add a static NAT entry at Router (AS65002) to map source IP of 10.0.0.25 to 10.100.0.50 when the packet leaves Router (AS65002), so that the OP_VM can properly route the traffic back to Router (AS65002). After the response is received, NAT takes effect again, and Router (AS65002) then replaces 10.100.0.50 with 10.0.0.25, and sends packets back to GCP_VM.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.