Tf plan stucks on Refreshing state

Hello,

I have tf config which uses the following modules:

  source      = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc"
  version     = "5.39.0"

  source   = "terraform-aws-modules/eks-pod-identity/aws"
  version  = "1.1.0"

and resource aws_eks_pod_identity_association.

There are two environments - stg and perf envs.

The state of perf is bigger:

# stg
tf state list | wc -l
5380

# perf
tf state list | wc -l
7959

I changed tf config which leads to recreation of aws_eks_pod_identity_association resources. In stg it worked well.

But in perf it stucks on:

...
module.iam_assumable_role["foo"].aws_iam_role_policy_attachment.custom[2]: Refreshing state... [id=xxx]

Always different role and attachment. Not the same. Seems to be randomly.
I’m not sure if it is due rate limiting or some other issue.
There is no any errors. Just stuck.

The tf plan in perf env without changes updates state well showing Refreshing state... and finishing with No changes. Your infrastructure matches the configuration.. But when I’m trying to plan my new changes, it hangs.

I tried to run with TF_LOG=debug tf plan which hangs on something like this for few minutes:

2024-12-06T02:18:02.104+0900 [DEBUG] ReferenceTransformer: "module.pod_identity[\"xxx\"].aws_iam_role_policy_attachment.this[\"arn:aws:iam::123:policy/my_policy\"]" references: [module.pod_identity.var.additional_policy_arns (expand) module.pod_identity.var.create (expand) module.pod_identity.aws_iam_role.this (expand)]

and then loops infinitely over DestroyEdgeTransformer2 log lines.

Also tried to run with TF_LOG=TRACE tf plan which run for 10+ hours and looped on DestroyEdgeTransformer and DestroyEdgeTransformer2 which seems to me as some internal state things.

If I do run tf plan -target=module.iam_assumable_role on my changes, then it shows desired changes:

Plan: 3404 to add, 0 to change, 3404 to destroy.

What it might be? How to debug the issue?

Hi @b10s,

Are you using the latest version of Terraform?

Checking dependencies when replacing resources is unfortunately a computationally expensive task, so DestroyEdgeTransformer tends to be slow in complex configs, but you seem to have some edge case which is causing terraform to traverse an immense number of dependencies. Are you trying to use depends_on for an entire module by any chance?

Hi @jbardin ,

My version is

> tf version
Terraform v1.7.4
on darwin_arm64
+ provider registry.terraform.io/alekc/kubectl v2.1.3
+ provider registry.terraform.io/hashicorp/aws v5.40.0

Your version of Terraform is out of date! The latest version
is 1.10.0. You can update by downloading from https://www.terraform.io/downloads.html

There is no depends_on in my config:

> grep -rni depends .
./.terraform/terraform.tfstate:71:            "depends_on": []

update: I’ve installed latest terraform, v1.10.1 was released while I was writing this message :slight_smile: but issue still persists - it stops on Refreshing state....

update2: I also tried with latest version of providers and modules:

> tf version
Terraform v1.10.1
on darwin_arm64
+ provider registry.terraform.io/alekc/kubectl v2.1.3
+ provider registry.terraform.io/hashicorp/aws v5.80.0

  source   = "terraform-aws-modules/eks-pod-identity/aws"
  version  = "1.7.0"

  source      = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc"
  version     = "5.48.0"

still the same deadlock. Might it be these two modules should not be used together? From other side in another environment it works well.

update3: might it be that state of 7 000 resources is too big for aws provider and it silently hangs? Or should aws provider emit some error when it can’t refresh state?

If Terraform is emitting logs there’s no deadlock yet, but it might be taking exponential time to process. Not having depends_on is a good sign since you’re not adding excessive dependencies where they’re not needed, but it’s still possible to build pathological configurations. Take a more extreme example, where half your instances somehow depends on every one of the other half, that sets a lower bound of 12,250,000 edges we need to track and traverse to determine dependencies.

The provider doesn’t really care how many resources there are, it’s only concerned with the few (usually 10 or less) at a time being called by Terraform. The usual limitations there are rate limits, but you would be seeing logs about provider calls waiting for that, not graph transformations.

There are many more performance improvements available in v1.10 than in v1.7, however I think you are stuck in a notoriously slow section which is not often called in most situations. Checking dependencies for replacement actions, which span across modules, and have dependencies from multiple providers adds up a lot of work that needs to be done for each of 3404 instances.

Just to verify the situation, when using TF_LOG=trace the process is blocked in DestroyEdgeTransformer but continues to output log lines? Can you give an example of what those lines are?

@jbardin got it. Thank you!

Indeed with TF_LOG=trace the tf plan always showing something however without TF_LOG=trace the tf plan simply hangs on something like module.iam_assumable_role["foo"].aws_iam_role_policy_attachment.custom[5]: Refreshing state... [id=bar-123]

With trace after about 10 minutes of running tf plan I do see something like:


2024-12-06T23:09:53.529+0900 [DEBUG] DestroyEdgeTransformer2: module.pod_identity["foo"].aws_iam_role_policy_attachment.this["arn:aws:iam::123:policy/my-policy1"] has stored dependency of aws_eks_pod_identity_association.pod-identity-eks-region["bar"] (destroy)

2024-12-06T23:09:53.655+0900 [DEBUG] DestroyEdgeTransformer2: module.pod_identity["baz"].aws_iam_role_policy_attachment.this["arn:aws:iam::123:policy/my-policy2"] has stored dependency of aws_eks_pod_identity_association.pod-identity-eks-region["asd"] (destroy)

2024-12-06T23:09:53.779+0900 [DEBUG] DestroyEdgeTransformer2: module.pod_identity["qwe"].aws_iam_role_policy_attachment.this["arn:aws:iam::123:policy/my-policy3"] has stored dependency of aws_eks_pod_identity_association.pod-identity-eks-region["zxc"] (destroy)

Thanks for the update!

That confirms my suspicion, and I was actually able to make a reproduction case based on the realization that you have a lot of resource replacements which may have dependencies from different providers.

The fact that Terraform allows providers to depend on managed resources inherently can cause cycles with certain combinations of actions. This is an architectural issue from the original design which we cannot really change, but it is not a recommended configuration so it’s not often seen at any real scale. The only method we currently have to proceed around these inter-provider dependencies is to check for cycles as we add each dependency, which is quite expensive, but again not often seen.

I’m going to look into mitigating this further in Terraform, maybe there’s some more cases where we can skip it. In the meantime, if the changes are constrained to one module, using -target is a good workaround.

If you want to try and avoid the problem entirely, when a provider needs to depend on a managed resource, that is usually a sign you need to break up the configuration into multiple separate configs and apply them in stages. Dependency issues aside, the usual problem with this type of configuration is that if there is a change which causes the provider’s configuration to become unknown during the plan you may not be able to proceed without more use of -target or manual changes outside of Terraform.

@jbardin , may I ask a question, when you mention different providers, which ones do you mean?

I checked my config, it seems we use only one aws provider:

> tf version
Terraform v1.10.1
on darwin_arm64
+ provider registry.terraform.io/alekc/kubectl v2.1.3
+ provider registry.terraform.io/hashicorp/aws v5.40.0

here is kubectl provider but it is not in use. Just leftover from copy&paste from other code.


Also to understand better terminology, what do you mean under managed resources in the phrase Terraform allows providers to depend on managed resources? Sorry for such a question. Do you mean here resources which are not defined in tf code?

Oh, then that may yet be a slightly different case! The fact that you are stuck seeing loads of DestroyEdgeTransformer2 calls does narrow it down enough to try and optimize that section more though. The slowdown is usually triggered by seeing multiple providers, but maybe there’s just plain too many edges being checked – would it be possible to share your configuration to see what else it may be doing?

What we refer to as a “managed resource” is any normal resource in the configuration which will be managed by Terraform. This is opposed to a “data resource” which is prefixed with data and only read by Terraform but managed elsewhere. (And now we have a third “ephemeral resource” mode, prefixed with ephemeral)

Yup, I just shared my code via email.

Thank you for clarification of the terms.

Thanks @b10s,

I’m not able to build up a complete understanding of the config in the short time I have, and it’s missing a few module directories too (data-module stuff), but something I forgot was to mention that multiple aliases of the same provider type can cause the same problem. So I think the issue is being triggered by the resources using default aws provider and the aws.aws-apne1 provider used in the module. I’m not sure yet if your changes caused the addition of the extra dependencies, or if the dependencies were always there and the changes are just causing so many replacements that it only now shows up as a problem.

I don’t see any obvious mistakes that jump out, I think it’s just more of a situation that you have a complex config which is hit hard by a particular edge case. If perhaps the use of the default provider could be combined with the region-specific instance used by the module so that the bulk of dependencies are from the same provider instance you could avoid the problem.

This did lead me to creating a good example to work with though, so thank you! We’ll have to invent some new ways to avoid these checks, and/or work on streamlining the stored dependencies even more.

@jbardin thank you for checking and suggesting.

Indeed, I tried and can confirm that removing aliases to the same provider but with different region, even though second alias was not used anywhere, helped. Now my plan can finish!

Changed the:

provider "aws" {
  alias  = "aws-apne1"
  region = "ap-northeast-1"
}

provider "aws" {
  alias  = "aws-apne3"
  region = "ap-northeast-3"
}

to

provider "aws" {
  region = "ap-northeast-1"
}

and used aws to reference to this provider.

Interesting case. I wonder why there is such a difference and how tf represents it internally.
But probably it is too difficult to understand.

I will also try to split a bit this code into two module and will see if it will also help.

Yes, as far as Terraform is concerned during evaluation, aliased providers are just as independent as entirely different providers, they have different configurations and run in different processes.

If it turns out that you can run the dependent portions of the config with the same provider instance, that would be the best solution, and splitting up the config would not be necessary in that case. In the long term we can work on speeding up the dependency resolution across providers to prevent this kind of performance trap.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.