I have a VPC with two Compute Engine VM instances in it. One of them, vpn-server, is acting as a VPN for a cluster of on-premises computers. The other, test-instance, is configured with an instance tag route-through-vpn that routes traffic to the vpn-server if it's going to 10.10.0.0/19.
There is also an AppEngine instance that has the route-through-vpn instance tag. The webapp running in it can directly connect to our on-premises cluster.
This setup has worked just fine for over a year. Then yesterday, a small number of IP addresses suddenly stopped working.
By "stopped working" I mean this:
- It is still possible to SSH into the non-working IP addresses if you're logged into the
vpn-server. - But traffic originating from
test-instancecannot reach these IPs.
One of the failing IPs is 10.10.0.8. One IP that still works is 10.10.0.47. As far as I can tell, all addresses correctly match the address range 10.10.0.0/19.
To debug, I logged into the vpn-server and the test-instance and tried sending ICMP packets from test-instance to various IP addresses in the cluster. I also ran tcpdump on the vpn-server so I could see the traffic as it passed through.
For the IP addresses that are still working, I saw the ICMP packets in the output of tcpdump, as expected. But for the IP addresses that are no longer working, I see nothing in tcpdump, indicating that Gcloud's routing layer is not even sending the traffic to my vpn-server.
To test further, I shut down one of the on-premises machines whose traffic is being routed properly, and I tried pinging it. The ICMP echo request packets appeared in the output of tcpdump with no replies, exactly as expected.
Google Cloud's routes don't have a whole lot of options, and there's no information available that would help me investigate further, so now it's down to somebody just happening to know why this would happen.
Has anybody solved a problem like this or have any idea what might be the cause?