Troubleshooting unreliable ENI destroy (AWS EKS)

Dear forum users,

I’ve read that asking for help can better be done on the forum, as I’m not 100% sure if my issue is a bug in Terraform or if I’m doing something wrong.

I’m using Terraform to deploy an AWS EKS-cluster, an AWS Application Load Balancer, and several Helm releases, several Kubernetes resources, S3 buckets, Route53 records, and more. Everything is created in one terraform apply run and cleaned up in one terraform destroy run. However, after I refactored the Terraform declarations and upgraded Terraform modules and providers recently, I’m experiencing an issue.

The EKS cluster has two worker nodes most of the time, each EC2 instance has three associated elastic network interfaces (ENIs). When destroying, zero to two of these ENIs remain behind (always ones with a private IP), dangling with status available. This prevents Terraform from destroying the security group attached to those ENIs, so after 10 minutes the operation fails with the error DependencyViolation for Security Group. If I manually delete the ENI in the AWS Console and try terraform destroy again, the security group is destroyed successfully.

I also started to make a minimal reproduction in this repository: GitHub - TjeuKayim/test-tf-destroy-eks-auth: Reproduce Terraform bug
But while narrowing down a minimal reproduction, the failure rate decreased to like 5%, making it hard to trace the root cause. At least these Terraform declarations are very close to the ones in the closed source project that I’m working on.

Also the latest 1.15 release candidate didn’t solve the problem, as I experienced two more failures today.

This was the original issue I opened: Unreliable ENI destroy · Issue #1267 · terraform-aws-modules/terraform-aws-eks

My guess is that the order of operations while destroying is causing issues, but I couldn’t find evidence of that in the trace logs, as I’m not sure how to interpret everything.

This is a link to the trace logs, I logged one failed run and a successful run for comparison.

What would you recommend to me as next steps in troubleshooting the root cause?

Have any news in this issue?

Ever work this out? It seem intermittent.