Troubleshooting unreliable ENI destroy (AWS EKS)

TjeuKayim · April 2, 2021, 5:48pm

Dear forum users,

I’ve read that asking for help can better be done on the forum, as I’m not 100% sure if my issue is a bug in Terraform or if I’m doing something wrong.

I’m using Terraform to deploy an AWS EKS-cluster, an AWS Application Load Balancer, and several Helm releases, several Kubernetes resources, S3 buckets, Route53 records, and more. Everything is created in one terraform apply run and cleaned up in one terraform destroy run. However, after I refactored the Terraform declarations and upgraded Terraform modules and providers recently, I’m experiencing an issue.

The EKS cluster has two worker nodes most of the time, each EC2 instance has three associated elastic network interfaces (ENIs). When destroying, zero to two of these ENIs remain behind (always ones with a private IP), dangling with status available. This prevents Terraform from destroying the security group attached to those ENIs, so after 10 minutes the operation fails with the error DependencyViolation for Security Group. If I manually delete the ENI in the AWS Console and try terraform destroy again, the security group is destroyed successfully.

I also started to make a minimal reproduction in this repository: GitHub - TjeuKayim/test-tf-destroy-eks-auth: Reproduce Terraform bug
But while narrowing down a minimal reproduction, the failure rate decreased to like 5%, making it hard to trace the root cause. At least these Terraform declarations are very close to the ones in the closed source project that I’m working on.

Also the latest 1.15 release candidate didn’t solve the problem, as I experienced two more failures today.

This was the original issue I opened: Unreliable ENI destroy · Issue #1267 · terraform-aws-modules/terraform-aws-eks

My guess is that the order of operations while destroying is causing issues, but I couldn’t find evidence of that in the trace logs, as I’m not sure how to interpret everything.

This is a link to the trace logs, I logged one failed run and a successful run for comparison.

gist.github.com

https://gist.github.com/TjeuKayim/2b4632236f91f53f9c92a8c7c3486a78

destroy-fail-console.log

kubernetes_api_service.metrics_server: Destroying... [id=v1beta1.metrics.k8s.io]
kubernetes_cluster_role.aggregated_metrics_reader: Destroying... [id=system:aggregated-metrics-reader]
helm_release.cluster_autoscaler: Destroying... [id=cluster-autoscaler]
module.stream.kubernetes_horizontal_pod_autoscaler.hpa: Destroying... [id=default/stream]
module.blob_storage.kubernetes_horizontal_pod_autoscaler.hpa: Destroying... [id=default/blob]
module.db_rest.kubernetes_horizontal_pod_autoscaler.hpa: Destroying... [id=default/db-rest]
kubernetes_horizontal_pod_autoscaler.pulsar: Destroying... [id=pulsar/pulsar-broker]
kubernetes_cluster_role.aggregated_metrics_reader: Destruction complete after 0s
kubernetes_api_service.metrics_server: Destruction complete after 0s
kubernetes_service.metrics_server: Destroying... [id=kube-system/metrics-server]

This file has been truncated. show original

destroy-fail-trace.log

2021/03/09 09:28:29 [INFO] Terraform version: 0.13.5  
2021/03/09 09:28:29 [INFO] Go runtime version: go1.13.8
2021/03/09 09:28:29 [INFO] CLI args: []string{"/home/{user}/.local/bin/terraform-13", "destroy"}
2021/03/09 09:28:29 [DEBUG] Attempting to open CLI config file: /home/{user}/.terraformrc
2021/03/09 09:28:29 Loading CLI configuration from /home/{user}/.terraformrc
2021/03/09 09:28:29 [DEBUG] checking for credentials in "/home/{user}/.terraform.d/plugins"
2021/03/09 09:28:29 [DEBUG] Explicit provider installation configuration is set
2021/03/09 09:28:29 [TRACE] Selected provider installation method cliconfig.ProviderInstallationDirect with includes [] and excludes []
2021/03/09 09:28:29 [INFO] CLI command args: []string{"destroy"}
2021/03/09 09:28:29 [TRACE] Meta.Backend: built configuration for "s3" backend with hash value 893988343

This file has been truncated. show original

destroy-ok-console.log

There are more than three files. show original

What would you recommend to me as next steps in troubleshooting the root cause?

vovantamvn · January 17, 2022, 1:36am

Have any news in this issue?

jimmyraywv · October 10, 2022, 11:02pm

Ever work this out? It seem intermittent.

Topic		Replies	Views
EKS cluster not destroyed completely. Getting `Error: Unauthorized` AWS k8s	11	7733	October 9, 2022
Kubernetes resources destroyed too late (after worker_group & fargate) HCP Terraform hcp-terraform	1	714	April 15, 2021
Aws_eks_node_group rolling update AWS k8s	0	552	November 23, 2021
Questions about aws_eks_node_group Terraform	1	775	April 29, 2020
EKS aws_cloudwatch_log_group recreates via stream AWS	3	688	February 14, 2025

Troubleshooting unreliable ENI destroy (AWS EKS)

Related topics