Terraform GWLB NAT Gateway - Outbound Traffic from Private Subnet Fails/Hangs Despite Healthy Targets

Hello everyone,

I’m building a custom, highly-available NAT solution in AWS using a Gateway Load Balancer (GWLB) and an EC2 Auto Scaling Group for the NAT appliances. My goal is to provide outbound internet access for instances located in a private subnet.

The Problem: Everything appears to be configured correctly, yet outbound traffic from the private instance fails. Commands like curl google.com or ping 8.8.8.8 hang indefinitely and eventually time out.

Architecture Overview: The traffic flow is designed as follows: Private Instance (in Private Subnet)Private Route TableGWLB EndpointGWLBNAT Instance (in Public Subnet)Public Route TableIGWInternet

What I’ve Verified and Debugged:

  1. GWLB Target Group: The target group is correctly associated with the GWLB. All registered NAT instances are passing health checks and are in a Healthy state. I have at least one healthy target in each Availability Zone where my workload instance resides.

  2. NAT Instance Itself: I can SSH directly into the NAT appliance instances. From within the NAT instance, I can successfully run curl ``google.com. This confirms the instance itself has proper internet connectivity.

  3. NAT Instance Configuration: The user_data script runs successfully on boot. I have verified on the NAT instances that:

    • net.ipv4.ip_forward is set to 1.

    • The geneve0 virtual interface is created and is UP.

    • An iptables -t nat -A POSTROUTING -o <primary_interface> -j MASQUERADE rule exists and is active.

  4. Routing Tables: I believe my routing is configured correctly to handle both ingress and egress traffic symmetrically (Edge Routing).

    • Private Route Table (private-rt): Has a default route 0.0.0.0/0 pointing to the GWLB VPC Endpoint (vpce-...). This is associated with the private subnet.

    • Public Route Table (public-rt): Has two routes:

      1. 0.0.0.0/0 pointing to the Internet Gateway (igw-...).

      2. [private_subnet_cidr] (e.g., 10.20.0.0/24) pointing back to the GWLB VPC Endpoint (vpce-...) to handle the return traffic. This route table is associated with the subnets for the NAT appliances and the GWLB Endpoint.

  5. Security Groups & NACLs: Security Groups on the NAT appliance allow all traffic from within the VPC. I am using the default NACLs which allow all traffic.

Despite all of the above, the traffic from the private instance does not complete its round trip.

My Question: Given that the targets are healthy, the NAT instances themselves are functional, and the routing appears to be correct, what subtle configuration might I be missing? Is there a known issue or a specific way to further debug where the return traffic is being dropped?

the link of repo https://github.com/taha2samy/try

1 Like