Hello everyone,
I’m building a custom, highly-available NAT solution in AWS using a Gateway Load Balancer (GWLB) and an EC2 Auto Scaling Group for the NAT appliances. My goal is to provide outbound internet access for instances located in a private subnet.
The Problem: Everything appears to be configured correctly, yet outbound traffic from the private instance fails. Commands like curl google.com or ping 8.8.8.8 hang indefinitely and eventually time out.
Architecture Overview: The traffic flow is designed as follows: Private Instance (in Private Subnet) → Private Route Table → GWLB Endpoint → GWLB → NAT Instance (in Public Subnet) → Public Route Table → IGW → Internet
What I’ve Verified and Debugged:
-
GWLB Target Group: The target group is correctly associated with the GWLB. All registered NAT instances are passing health checks and are in a
Healthystate. I have at least one healthy target in each Availability Zone where my workload instance resides. -
NAT Instance Itself: I can SSH directly into the NAT appliance instances. From within the NAT instance, I can successfully run
curl ``google.com. This confirms the instance itself has proper internet connectivity. -
NAT Instance Configuration: The
user_datascript runs successfully on boot. I have verified on the NAT instances that:-
net.ipv4.ip_forwardis set to1. -
The
geneve0virtual interface is created and isUP. -
An
iptables -t nat -A POSTROUTING -o <primary_interface> -j MASQUERADErule exists and is active.
-
-
Routing Tables: I believe my routing is configured correctly to handle both ingress and egress traffic symmetrically (Edge Routing).
-
Private Route Table (
private-rt): Has a default route0.0.0.0/0pointing to the GWLB VPC Endpoint (vpce-...). This is associated with the private subnet. -
Public Route Table (
public-rt): Has two routes:-
0.0.0.0/0pointing to the Internet Gateway (igw-...). -
[private_subnet_cidr](e.g.,10.20.0.0/24) pointing back to the GWLB VPC Endpoint (vpce-...) to handle the return traffic. This route table is associated with the subnets for the NAT appliances and the GWLB Endpoint.
-
-
-
Security Groups & NACLs: Security Groups on the NAT appliance allow all traffic from within the VPC. I am using the default NACLs which allow all traffic.
Despite all of the above, the traffic from the private instance does not complete its round trip.
My Question: Given that the targets are healthy, the NAT instances themselves are functional, and the routing appears to be correct, what subtle configuration might I be missing? Is there a known issue or a specific way to further debug where the return traffic is being dropped?
the link of repo https://github.com/taha2samy/try