Handle perms' eventual consistency/propagation time

Hi!

I have two resources:

  • google_project_iam_member that grants a certain role to a Service Account
  • google_workbench_instance (think: a Jupyter notebook) where the role is necessary to read an init script for the notebook

The problem is that sometimes the init script cannot be fetched (access denied), as if the role addition didn’t propagate yet. Following the docs on access change propagation, it can take from minutes to hours for the access change to fully propagate.

We thought that adding a depends_on relationship will solve the issue. However, it doesn’t, because probably the GCP provider is built in a way to assume that the resource is created as soon as the creation API call is made and successful. It doesn’t account for the propagation time.

One possible solution is to add a waiting null_resource and wait there for several minutes. This, however, seems hacky since you can be never sure how long to wait in general, and too long waiting degrades the UX of our Terraform configs.

What I think would be optimal is to be able to change google_project_iam_member’s behavior (e.g. through a flag) to assume its creation is complete once making the access change fully propagates. It could query proper API to check if the role is available.

What’s the right way to approach this problem? I think the problem is common enough that it deserves a clean solution.
Any hints greatly appreciated!

PS: There’s a similar topic but without any meaningful solution:

1 Like

There’s some traction in a related GitHub issue: Handle perms’ eventual consistency/propagation time · Issue #22521 · hashicorp/terraform-provider-google · GitHub

From Terraform’s perspective, a resource should be fully usable once the provider’s ApplyResourceChange method returns. A resource could even have a configuration attribute which tells the provider to wait for full consistency when it’s needed, or return early when it’s not.

Attempting to deal with any eventual consistency problem with an arbitrary delay never fully solves the problem because there is always the case where the delay is insufficient and the operation still fails. There needs to be some sort of API call made to verify that the resource is complete, and the provider is the only one with the knowledge to do that.