Need to run terraform plan/apply twice to effect all changes

I’ve got a scenario in which changing an input requires running plan and apply twice. The first run creates some resources and the second uses those new resources to modify some others. I’m having a heckuva time reducing this to a small example—my configuration is rather complex!

On changing an input, I expect a new IAM Policy to be created and for its ARN to be added to some Roles’ managed_policy_arns. Instead, the first apply creates the policy and the second adds the arn to the role.

$ terraform graph
...
"[root] module.aws.module.prod.aws_iam_role.user (expand)" -> "[root] module.aws.module.prod.local.policy_arns_by_user (expand)"
"[root] module.aws.module.prod.local.policy_arns_by_user (expand)" -> "[root] module.aws.module.prod.local.all_managed_pols (expand)"
"[root] module.aws.module.prod.local.all_managed_pols (expand)" -> "[root] module.aws.module.prod.local.custom_policy_arns (expand)"
"[root] module.aws.module.prod.local.custom_policy_arns (expand)" -> "[root] module.aws.module.prod.aws_iam_policy.create (expand)"
...

These dependencies look correct. But when I make a change that results in a new aws_iam_policy.create element, and then plan:

$ terraform plan -out=tfplan
...
# module.aws.module.prod.aws_iam_policy.create["deployer"] will be created
... (but I would expect change in the aws_iam_role.user)
Plan: 1 to add, 0 to change, 0 to destroy.

$ terraform graph -plan=tfplan
...
"[root] module.aws.module.prod.local.policy_arns_by_user (expand)" -> "[root] module.aws.module.prod.local.all_managed_pols (expand)"
"[root] module.aws.module.prod.local.all_managed_pols (expand)" -> "[root] module.aws.module.prod.local.custom_policy_arns (expand)"
"[root] module.aws.module.prod.local.custom_policy_arns (expand)" -> "[root] module.aws.module.prod.aws_iam_policy.create[\"deployer\"]"
...

I don’t see module.aws.module.prod.aws_iam_role.user anywhere in the plan graph, and indeed the plan does not include updating the iam role.

I’ve got all the locals bubbled up as outputs, and the all corectly show up in this first plan:

~ aws_iam_policy_create = [      // <- this is keys(aws_iam_policy.create)
    + "deployer",
...
~ custom_policy_arns    = {
  + deployer           = (known after apply)
...
~ all_managed_pols      = {
    ~ deployer           = [
        + (known after apply),
      ]
      ...
  }
+ policy_arns_by_user   = {
  + "user1"                 = (known after apply)
...

After applying, if I immediately plan again:

$ terraform plan -out=tfplan
# module.aws.module.prod.aws_iam_role.user["user1"] will be updated in-place
Plan: 0 to add, 1 to change, 0 to destroy.

$ t graph -plan=tfplan
"[root] module.aws.module.prod.aws_iam_role.user[\"user1\"]" -> "[root] module.aws.module.prod.local.policy_arns_by_user (expand)"
"[root] module.aws.module.prod.local.policy_arns_by_user (expand)" -> "[root] module.aws.module.prod.local.all_managed_pols (expand)"
"[root] module.aws.module.prod.local.all_managed_pols (expand)" -> "[root] module.aws.module.prod.local.custom_policy_arns (expand)"
"[root] module.aws.module.prod.local.custom_policy_arns (expand)" -> "[root] module.aws.module.prod.aws_iam_policy.create (expand)"

There are no data sources in this chain, eg. ones that lookup resources that are created in the first apply. The first apply creates a policy and puts its arn into a local, which is referenced in the role resource. But that dependency is getting missed.

I have no other policy attachment resources defined (ala the warnings on the aws_iam_role docs page). I am using inline_policy and managed_policy_arns only, so this resource has exclusive management of them.

resource "aws_iam_policy" "create" {
  for_each = local.create_custom_policies
  ...
}

locals {
  custom_policy_arns = {
    for handle, iam_policy in aws_iam_policy.create :
    handle => iam_policy.arn
  }

  // some unexplained complexity here, but confident it works because output
  // above shows "(known after apply)"
  all_managed_pols = {
    for duty, conf in var.duties :
    duty => concat(conf.managed-policies, [
      for handle in keys(conf.custom-policies) :
      local.custom_policy_arns[handle]
    ])
  }

  policy_arns_by_user = {
    for username, user in local.all_account_users_who_need_roles :
    username => distinct(flatten([
      for duty in user.duties :
      lookup(local.all_managed_pols, duty, [])
    ]))
  }
}

resource "aws_iam_role" "user" {
  for_each            = local.all_account_users_who_need_roles
  managed_policy_arns = local.policy_arns_by_user[each.key]
  ...
}

Phew! If anyone made it this far, I greatly appreciate your reading.

As a possible workaround, I’ve just tried moving to aws_iam_role_policy_attachment resources instead of the managed_role_arns argument on aws_iam_role. To clarify, I removed the managed_role_arns argument from the role resource entirely.

When I planned I was susprised to see adds (correct), but no changes to the existing roles. I would expect the aws_iam_role to show me removing the managed_role_arns.

Plan: 1472 to add, 0 to change, 0 to destroy.

This may be a pointer to the underlying issue: the managed_role_arns argument on the aws_iam_role resource doesn’t see all dependent changes?

I suspect this move will take 2 applies. Will update when I can test that.

Hi @lordbyron,

Unfortunately I’m having a little trouble following the details because of all of the redactions, so I wasn’t able to come to any specific conclusions so far, but I do have some general observations that I hope will give some hints for things to try next.

There are two typical reasons why this sort of thing might happen.

  1. If the configuration effectively modifies its own desired state while being applied in a way that Terraform can’t deduce, such as if it’s reading some data using a data resource but also changing that same object and there’s no dependency relationship between the two for Terraform to notice that possibility.
  2. If the provider is buggy in a way that causes a similar effect, such as if the provider returns an incomplete or otherwise-incorrect result from its apply step which then gets contradicted when it refreshes the object in the next plan.

It sounds like you already ruled out the first of these by noticing that there are no data resources, so my next thought would be a potentially buggy provider, but I can’t think of any hashicorp/aws bugs I know about that would cause this particular malfunction, and it seems like you’re just mapping two resources together by doing some acrobatics with their instance keys, and so the outcome doesn’t seem like it should be affected by changes to the attributes of the individual resource instances, which are what the provider controls.

You’ve only shown a tiny fragment of the plan output, so it’s hard to guess what might be going on here. If you see a section saying “Objects changed outside of Terraform” then that would be good to know since it might be an specific indicator of problem 2.

These features of the AWS provider are quite old and so are, I think, still implemented using the legacy plugin SDK, which has some quirks of its own which might be in play here. It might be interesting to do your first terraform plan and terraform apply with the environment variable TF_LOG=trace set to get Terraform’s internal logs, and search the output for the word “tolerate” where I suspect you’ll find some lines saying something like “hashicorp/aws did something odd but we’re tolerating it because it’s using the legacy SDK”.

That tolerance is there because the SDK typically does some odd things, so the mere presence of that warning is not a cause for concern, but the specific details it reports might give some insight about what’s going on if you’re able to share them here. (It’s hard for me to know ahead of time what might be useful vs. un-useful examples, so unfortunately I can’t do any better than “share it with me and I’ll see if anything sticks out as weird”. :confounded: )

Incredible response, thank you! I wasn’t initially able to reproduce it with a smaller test file, but I’ve just figured it out! It is attached. There is a comment in the attached file.

Additional updates:

You were right, there is a comment in the trace

2023-02-17T16:29:29.799-0800 [WARN]  Provider "registry.terraform.io/hashicorp/aws" produced an invalid plan for aws_iam_role.user["user1"], but we are tolerating it because it is using the legacy plugin SDK.
    The following problems may be the cause of any confusing errors from downstream operations:
      - .force_detach_policies: planned value cty.False for a non-computed attribute

though it’s not obvious to me if that is the culprit.

I also found that, after testing it with aws_iam_role_policy_attachment instead of managed_policy_arns on the role resource, that the state gets a little confused about its managed_policy_arns attribute:

  1. Removing the managed_policy_arns while adding the aws_iam_role_policy_attachment resulted in no changes on the role resource.
  2. After applying and fetching state, the managed_policy_arns attribute is still populated, even though it is not configured in the role resource. There are no additional changes on subsequent plans.
  3. If I create a new user in this setup the managed_policy_arns is empty (as expected) even though the existing users still have some.

Anyway I don’t want to muddle this with a second issue, if it’s not related! I hope the attached example makes it pretty clear. I put a comment in there “THIS IS IT” to indicate the line that should be changed.

  1. Plan and apply as is
  2. Uncomment the marked line and plan, see the issue I’m describing
  3. Apply that and plan again, see further pending changes

Thank you, Martin! You are a legend.
main.tf.txt (3.3 KB)

Hi @lordbyron,

Indeed, that “tolerating” warning seems like one of the expected ones. It results from the fact that Terraform v0.11 didn’t have a concept of null and so the SDK represented the absense of an argument by setting it to some “zero value” of the designated type, which for a boolean value would be false as we see here. (cty.False is Terraform’s internal representation of false). So this is just the old SDK used by this resource type swapping false where Terraform Core expected to see null, but in practice that’s harmless and just a weird historical quirk.

I feel a bit unsure from the rest of your comment if you’re saying that you’ve solved the problem or if you’d like some more suggestions. I did review the file you attached but nothing jumps out as me as obviously wrong, so I just wanted to check with you before I spend more time on it whether there’s still a question here!

Unfortunately I don’t have regular access to an AWS account where I’m permitted to create IAM objects (I don’t typically work on the AWS provider, so I don’t have the dev environment access that the provider development team would) and so I won’t be able to actually exercise the configuration you shared, but if you’d be willing to share Terraform’s output for each step you tried – including the planned changes that you’re wondering about – I will do my best to interpret what Terraform is suggesting and see if I can guess what in the configuration might cause that, or if there’s just a provider bug here to report.

Thanks!

Thanks for this additional info, and apologies for any lack of clarity. I have not solved the problem; my workaround is simply to run tf twice. I believe using the separate attachment resource would also work, but it adds complexity that I don’t want to deal with atm.

So yes, I do believe this remains an open bug in the AWS provider, but I’m not blocked. Well, it’s not entirely obvious to me that it is in the provider, since that plan graph seems to be missing a dependency, but I’m no expert.

If you suggest I open a ticket anywhere please let me know? Or if you want to route it / close it / prioritize it / not prioritize it, I can leave that to your discretion. I believe the example code I attached makes it easy to reproduce, but not if you can’t create IAM resources! I am also available to debug live, to produce an even smaller code example, or whatever else I can do to assist.

I think this has to do with Eventual consistency. I recently run into a similar situation when creating an IAM role and assuming that role immediately.
I had to implement some delay using terraform “time_sleep” resource block.

Check this video and attached resources https://www.youtube.com/watch?v=E7dWUJD57BU&t=32s

I hope this helps.