Why would a Google Cloud instance tag drop packets?

Question

I have a VPC with two Compute Engine VM instances in it. One of them, vpn-server, is acting as a VPN for a cluster of on-premises computers. The other, test-instance, is configured with an instance tag route-through-vpn that routes traffic to the vpn-server if it's going to 10.10.0.0/19.

There is also an AppEngine instance that has the route-through-vpn instance tag. The webapp running in it can directly connect to our on-premises cluster.

This setup has worked just fine for over a year. Then yesterday, a small number of IP addresses suddenly stopped working.

By "stopped working" I mean this:

It is still possible to SSH into the non-working IP addresses if you're logged into the vpn-server.
But traffic originating from test-instance cannot reach these IPs.

One of the failing IPs is 10.10.0.8. One IP that still works is 10.10.0.47. As far as I can tell, all addresses correctly match the address range 10.10.0.0/19.

To debug, I logged into the vpn-server and the test-instance and tried sending ICMP packets from test-instance to various IP addresses in the cluster. I also ran tcpdump on the vpn-server so I could see the traffic as it passed through.

For the IP addresses that are still working, I saw the ICMP packets in the output of tcpdump, as expected. But for the IP addresses that are no longer working, I see nothing in tcpdump, indicating that Gcloud's routing layer is not even sending the traffic to my vpn-server.

To test further, I shut down one of the on-premises machines whose traffic is being routed properly, and I tried pinging it. The ICMP echo request packets appeared in the output of tcpdump with no replies, exactly as expected.

Google Cloud's routes don't have a whole lot of options, and there's no information available that would help me investigate further, so now it's down to somebody just happening to know why this would happen.

Has anybody solved a problem like this or have any idea what might be the cause?

If you enable VPC Flow Logs, that will give you insight into what's happening on the network. — Travis J Webb, Oct 05 '19 at 03:12
I got flow logs turned on for all my subnets, and it collected a lot of data. I made some test connections while flow logs were on. But none of the data in the log matched the IP address or VM name of my `test-instance` at all. — Throw Away Account, Oct 05 '19 at 22:40

score -1 · Answer 1 · answered Oct 17 '19 at 22:24

It seems more like an instance configuration or routing table issue, If I understood correctly, the IP address 10.10.x.x/19 are from your on-Prem. We can discard firewall rules since I'm supposing that you have a rule similar to "allow ingress/egress traffic from the source / destination 10.10.0.0 /19" and if you're seeing that the IP address 10.10.0.47 still works, means that the firewall rule is working, seems more that a routing behavior, have you tried to clean your route table inside your instance? It can help to refresh your routing table. I know that GCP has an option where you can use an instance as gateway It sounds similar to what you are doing.

Why would a Google Cloud instance tag drop packets?

1 Answers1