I’m having resources disappear from tfstate (Terraform v1.1.7, state stored on gcs), and I believe the issue is fixed by Backport of Fix race in state dependency encoding into v1.1 by teamterraform · Pull Request #30959 · hashicorp/terraform · GitHub , which is planned to be released in v1.1.10
This issue is causing me serious problems with the Terraform automation, any estimate when it will be released ?
After the v1.2.0 release, the v1.2.x series is our focus for patch releases and v1.1 will only see new releases in the event of a high-urgency situation such as a security advisory. There are no currently-planned routine v1.1.x patch releases.
I would therefore suggest upgrading to the latest v1.2.x release in order to benefit from the change you mentioned. The v1.2.x series is still bound by our v1.0 compatibility promises and so upgrading to the latest v1.2 should typically not be any harder than upgrading to a new v1.1 release. If you do encounter any problems that the v1.2 upgrade guide doesn’t address, please do open a GitHub issue about it and we will investigate it as a bug.
Hi @apparentlymart ,
It’s a bit of a mixed message, for the change to have been backported into the 1.1 release branch and listed in the changelog file: terraform/CHANGELOG.md at v1.1 · hashicorp/terraform · GitHub if there are potentially no plans to ever release it.
If that’s the case, maybe you could add a banner to the top of the 1.1 changelog to set expectations appropriately?
Indeed, it’s unfortunate that the timing worked out that way. We do typically stop backporting in time for the next minor release to “take over” as the primary vehicle for bug fixes, but since our development process is asynchronous I guess you could say there was a bit of a process race condition here.
Changing the format of the changelog on the branch would likely interfere if we did subsequently need to make a v1.1.10 release for one of the exceptional reasons I alluded to, since the release process does automatic (and not very smart) tweaking of the changelog. However, we can have a look into what might make sense there, either within the bounds of what the release tools expect or with some improvements to the release tools. I think it would be nice to be clear on all of the older major/minor release changelogs about what their current development status is, since indeed currently the changelogs only really talk about what’s already happened and not what is likely (or less likely) to happen in the future.
I am usually conservative of using newest major/minor versions, and I wait to see that the pace at which patch versions are released slows down (This is for any product, not specific to TF). Version 1.2 is not quite there yet. On top of that, reverting to older versions is problematic with Terraform as the state may become incompatible once a minor version is reverted.
This particular bug fix is already merged to the 1.1 branch, it would really help me if you could release it
My deployment is divided to tens of separate states (due to execution times, duh), so backing up and then manually manipulating each one of them if I need to revert would be a tedious and risky task
As a side note: this issue appeared on our deployments when we tried to move Google BQ from the US-EAST1 region to US (multi region), using the following resource types:
Terraform version is 1.1.7, Google plugin version was initially 4.14.0, then upgraded to 4.22.0 but the problem remained (though with better error messages)
We migrated from v0.12.31 to v1.1.7 at the end of March and did not experience any issues till about 2 weeks ago, when we moved our BQ as described above, so the root cause may also be a misbehaving plugin on an edge case…
The resource that disappears the most is google_bigquery_dataset so I currently added resource imports for the datasets to my automation code before invoking terraform apply
That’s not the way terraform should work …
Of course I do respect your chosen policy on when to upgrade, but I do want to note that we typically produce patch releases of the current primary release at least every two weeks and so that is typically as slow as v1.2 releases will get. It is true that we did only wait one week after v1.2.0 before releasing v1.2.1, but so far it looks like we are ready to return to the usual two week release schedule now, of course keeping in mind that I cannot predict the future.
There will be another v1.1 release only in the event that there is a significant enough advisory or other such incident to prompt deviating from the usual release schedule.
For what it’s worth, my understanding of the issue you referred to doesn’t match the behavior you described here. While in principle data races can cause various kinds of corruption, the main effect we heard about and were able to reproduce is the recorded dependencies in the state being incorrect, and that’s quite a different result than an object vanishing from the state entirely.
An object “disappearing” is most commonly caused by the provider seeing a “not found” error message during refreshing and concluding that the object therefore no longer exists in the remote system. I can’t say for certain that is what is happening in your case, but I’ve seen it arise before in two incorrect situations:
- The remote system is only eventually consistent and so an object temporarily reads as not existing for a small period after it was reported as created.
- The credentials used to make the request don’t have sufficient credentials to see the object and the remote system is designed to return “not found” in that case, rather than an explicit permissions-related error, and so the provider cannot distinguish the two.
This does seem most likely to be a provider quirk to me, but I would not rule out it being a core bug that we’ve not seen before. I’d suggest reporting it in the Google provider GitHub repository for now, and the provider team can pass it on to the Core team if it seems like it isn’t a provider issue.
(Not sure if this is the correct thread to continue this discussion)
According to your description, if a provider is seeing “not found” once, for whatever reason (say a glitch), from that point on that resource will require manual intervention, as the resource does not get imported back automatically. This is really not a robust design. It would be much better to mark the resource as inaccessible in the state, causing the current iteration to fail (because the resource does exist), but then it could be automatically re-added on the next Terraform run.
In my case this is not a new resource, it is a BigQuery dataset which was created on previous TF runs and already includes tables and data. In fact, if I were to try and delete it using Terraform, that would fail because of deletion protection.
It is also not a credentials issue. All the datasets were created using the same credentials (using TF off course), and when I initiate a full TF run, only one or two out of tens of datasets suffer from this.
Is it possible that a rate limit on Google’s API could be causing this ? I run my jobs with the -parallelism=30 flag.
I can try to reproduce this with trace or debug logging to see if that’s really the case. Would that help ? Should I open this as a separate thread ?
In the typical “disappears” scenario I was describing, Terraform will ask the provider during planning to read the updated data for the object in question. The provider can return
null to indicate that the object no longer exists. In that case, Terraform knows the following:
- The object existed in the state snapshot produced by the previous run.
- The object doesn’t exist in the “refreshed state”, updated by asking the provider for current data.
- The instance is still declared in the configuration.
In that case, Terraform will typically produce a plan which includes both a recorded “change outside of Terraform” (the object was deleted) and a proposal to create a new object to bind to the instance declared in the configuration.
Until you actually apply that plan, nothing would’ve changed outside yet: the most recent state snapshot still includes the old object, and Terraform has not yet asked the provider to create a new object. If what Terraform is proposing doesn’t make sense, you can decline the plan and everything should be essentially as it was beforehand, aside from some incidental side-effects such as having possibly consumed some read request rate limit and some logging of the read requests on the server.
It’s only if you tell Terraform to apply the plan that the effect of removing the prior object from the state will be “locked in” with a new state snapshot. After that, you would indeed then need to reimport the object if you want Terraform to begin tracking it again.
It’s hard to say from just the high-level description if what you’re experiencing is the behavior I’ve described above or something else. I think the best next step would be to open an issue in the Google Cloud Platform provider’s repository and share the information they request in their new bug report template. The provider’s development team can then use their knowledge about the specific resource types you are using to determine if this is a bug in the implementation of those resource types or if it’s something more general.