Timeouts for create/delete (retry)

cdenneen · November 16, 2023, 2:22pm

So if you look at this code:

terraform-aws-modules/terraform-aws-eks/blob/master/node_groups.tf

locals {
  metadata_options = {
    http_endpoint               = "enabled"
    http_tokens                 = "required"
    http_put_response_hop_limit = 2
  }

  # EKS managed node group
  default_update_config = {
    max_unavailable_percentage = 33
  }

  # Self-managed node group
  default_instance_refresh = {
    strategy = "Rolling"
    preferences = {
      min_healthy_percentage = 66
    }
  }
}

This file has been truncated. show original

I’m having a problem where the cluster updates are performing so I’ve gotten into issue where cluster is “Updating/Modifying” and when it tries to create the fargate_profile it errors that the status of the cluster isn’t “Ready”.

In order to fix this I’m utilizing the var.dataplane_wait_duration and increased this to an arbitrary 500s which seems to get me past the issue (only upon creation though).

github.com

terraform-aws-modules/terraform-aws-eks/blob/ec454c51cb0026254b68cafe7aa9d05f873af990/node_groups.tf#L22-L39


      
          # This sleep resource is used to provide a timed gap between the cluster creation and the downstream dependencies
          # that consume the outputs from here. Any of the values that are used as triggers can be used in dependencies
          # to ensure that the downstream resources are created after both the cluster is ready and the sleep time has passed.
          # This was primarily added to give addons that need to be configured BEFORE data plane compute resources
          # enough time to create and configure themselves before the data plane compute resources are created.
          resource "time_sleep" "this" {
            count = var.create ? 1 : 0
          
            create_duration = var.dataplane_wait_duration
          
            triggers = {
              cluster_name     = aws_eks_cluster.this[0].name
              cluster_endpoint = aws_eks_cluster.this[0].endpoint
              cluster_version  = aws_eks_cluster.this[0].version
          
              cluster_certificate_authority_data = aws_eks_cluster.this[0].certificate_authority[0].data
            }
          }

The reason I did this is because I have timeouts set on the fargate_profile (20m create and 20m delete):

github.com

terraform-aws-modules/terraform-aws-eks/blob/ec454c51cb0026254b68cafe7aa9d05f873af990/node_groups.tf#L247


      
          for_each = { for k, v in var.fargate_profiles : k => v if var.create && !local.create_outposts_local_cluster }
          
          create = try(each.value.create, true)
          
          # Fargate Profile
          cluster_name      = time_sleep.this[0].triggers["cluster_name"]
          cluster_ip_family = var.cluster_ip_family
          name              = try(each.value.name, each.key)
          subnet_ids        = try(each.value.subnet_ids, var.fargate_profile_defaults.subnet_ids, var.subnet_ids)
          selectors         = try(each.value.selectors, var.fargate_profile_defaults.selectors, [])
          timeouts          = try(each.value.timeouts, var.fargate_profile_defaults.timeouts, {})
          
          # IAM role
          create_iam_role               = try(each.value.create_iam_role, var.fargate_profile_defaults.create_iam_role, true)
          iam_role_arn                  = try(each.value.iam_role_arn, var.fargate_profile_defaults.iam_role_arn, null)
          iam_role_name                 = try(each.value.iam_role_name, var.fargate_profile_defaults.iam_role_name, null)
          iam_role_use_name_prefix      = try(each.value.iam_role_use_name_prefix, var.fargate_profile_defaults.iam_role_use_name_prefix, true)
          iam_role_path                 = try(each.value.iam_role_path, var.fargate_profile_defaults.iam_role_path, null)
          iam_role_description          = try(each.value.iam_role_description, var.fargate_profile_defaults.iam_role_description, "Fargate profile IAM role")
          iam_role_permissions_boundary = try(each.value.iam_role_permissions_boundary, var.fargate_profile_defaults.iam_role_permissions_boundary, null)
          iam_role_tags                 = try(each.value.iam_role_tags, var.fargate_profile_defaults.iam_role_tags, {})

The cluster is ready in just under 8m so the dataplane_wait_duration of 500s does work for creation (but I would think I shouldn’t need this because create is set to 20m.

However this doesn’t help now with destroy, running into the same problem that the cluster is performing some “Updates” as it destroys pieces of itself and I end up getting:

│ Error: deleting EKS Fargate Profile (dev-use1-100:karpenter): operation error EKS: DeleteFargateProfile, https response error StatusCode: 409, RequestID: REDACT, api error ResourceInUseException: Cannot Delete Fargate Profile karpenter because cluster dev-use1-100 currently has update REDACT in progress
│ 
│

So my first question is why would the timeouts set on the fargate profile not work for creation and deletion (rather than needing to set arbitrary dataplane_wait_duration and second is how can I leverage a delay for the deletion of the fargate_profile to do the same wait.

If I understand how the timeouts work I believe during that 20m it should keep retrying but it seems that the second it gets this response that the “cluster has update in progress” it just stops retrying.

For context those timeouts are passed to the aws_eks_fargate_profile resource here:

github.com

terraform-aws-modules/terraform-aws-eks/blob/ec454c51cb0026254b68cafe7aa9d05f873af990/modules/fargate-profile/main.tf#L81-L87


      
          dynamic "timeouts" {
            for_each = [var.timeouts]
            content {
              create = lookup(var.timeouts, "create", null)
              delete = lookup(var.timeouts, "delete", null)
            }
          }

cdenneen · November 16, 2023, 2:31pm

For what it’s worth I’ve actually seen this same behavior on the fluxcd/flux provider with the flux_bootstrap_git resource.
I’ve watched the Creating happen for flux_bootstrap_git and it does it’s install and does a healthcheck. However if the healthcheck doesn’t return within a certain period of time (pods Pending longer than some healthcheck timeout) it appears that the resource doesn’t actually retry that logic and will never return Completed in terraform even though the TF apply is “Still creating…” and once a timeout is reached it “fails”.

So for a timeline of events lets say this happens

TF apply
Creating flux_bootstrap_git
TF still showing as “Still creating”
Pods Pending
TF still showing as “Still creating”
Healthcheck fail
TF still showing as “Still creating”
Pods Schedule
TF still showing as “Still creating”
Pods Running
TF still showing as “Still creating”
TF fails

Now this might be not the same issue exactly but the “healthcheck” failing seems to be similar to the cluster “update in progress” and doesn’t keep retrying even though the Create/Delete timeout haven’t been exceeded yet.

merv333 · December 14, 2023, 12:37pm

I am currently having the same issue. Were you able to find a solution or work around?

Topic		Replies	Views
Terraform hangs when creating aws_eks_node_group for EKS cluster AWS	0	1411	August 31, 2023
How can I add timeout between delete and create of same resource Terraform	0	336	March 27, 2020
Kubernetes resources destroyed too late (after worker_group & fargate) HCP Terraform hcp-terraform	1	715	April 15, 2021
Callback function at Timeout in Terraform Plugin SDKv2 Plugin Development	5	508	November 6, 2023
AWS-EMR Cluster Creation AWS	3	1231	January 30, 2021

Timeouts for create/delete (retry)

Related topics