I’m having a problem where the cluster updates are performing so I’ve gotten into issue where cluster is “Updating/Modifying” and when it tries to create the fargate_profile it errors that the status of the cluster isn’t “Ready”.
In order to fix this I’m utilizing the var.dataplane_wait_duration and increased this to an arbitrary 500s which seems to get me past the issue (only upon creation though).
The reason I did this is because I have timeouts set on the fargate_profile (20m create and 20m delete):
The cluster is ready in just under 8m so the dataplane_wait_duration of 500s does work for creation (but I would think I shouldn’t need this because create is set to 20m.
However this doesn’t help now with destroy, running into the same problem that the cluster is performing some “Updates” as it destroys pieces of itself and I end up getting:
│ Error: deleting EKS Fargate Profile (dev-use1-100:karpenter): operation error EKS: DeleteFargateProfile, https response error StatusCode: 409, RequestID: REDACT, api error ResourceInUseException: Cannot Delete Fargate Profile karpenter because cluster dev-use1-100 currently has update REDACT in progress
│
│
So my first question is why would the timeouts set on the fargate profile not work for creation and deletion (rather than needing to set arbitrary dataplane_wait_duration and second is how can I leverage a delay for the deletion of the fargate_profile to do the same wait.
If I understand how the timeouts work I believe during that 20m it should keep retrying but it seems that the second it gets this response that the “cluster has update in progress” it just stops retrying.
For context those timeouts are passed to the aws_eks_fargate_profile resource here:
For what it’s worth I’ve actually seen this same behavior on the fluxcd/flux provider with the flux_bootstrap_git resource.
I’ve watched the Creating happen for flux_bootstrap_git and it does it’s install and does a healthcheck. However if the healthcheck doesn’t return within a certain period of time (pods Pending longer than some healthcheck timeout) it appears that the resource doesn’t actually retry that logic and will never return Completed in terraform even though the TF apply is “Still creating…” and once a timeout is reached it “fails”.
So for a timeline of events lets say this happens
TF apply
Creating flux_bootstrap_git
TF still showing as “Still creating”
Pods Pending
TF still showing as “Still creating”
Healthcheck fail
TF still showing as “Still creating”
Pods Schedule
TF still showing as “Still creating”
Pods Running
TF still showing as “Still creating”
TF fails
Now this might be not the same issue exactly but the “healthcheck” failing seems to be similar to the cluster “update in progress” and doesn’t keep retrying even though the Create/Delete timeout haven’t been exceeded yet.