V2 SDK Provider is unexpectedly removing a nested block from state

:wave:t2:

So I’ve discovered an issue with our provider (still currently using the v2 SDK) that I’m not sure how to resolve because it’s unclear to me how it is even possible in the first place.

I have a ‘service’ resource that contains two nested blocks ‘A’ and ‘B’ (within our provider these nested blocks are setup as their own resources of sort and so they have their own CRUD lifecycle functions).

When it comes to our provider processing these nested blocks, they are done in the order ‘A’, then ‘B’.

I add the nested block ‘B’ to my service resource config and I successfully run a terraform apply. This results in the main ‘service’ resource and the nested ‘B’ block resource both being created via our API.

The problem occurs when I add to my config an incorrectly configured nested block ‘A’, while at the same time removing the nested block ‘B’ from the configuration.

What happens (I’m using TF_LOG=TRACE) when I run terraform apply I see that our Terraform provider calls our API for the nested block ‘A’ (remember A is processed first by our provider) and the API returns an error (because we configured the nested block with an invalid attribute value). The Terraform provider then reports the error and stops. There is no call to our API for deleting the nested block ‘B’ and so that item still exists as far as our API is concerned.

Inside our Terraform provider it is setup such that ‘A’ is always processed before ‘B’ (hence why in the logs we can see the API for ‘A’ was called first, and because that failed we didn’t see a call to our API for ‘B’).

But if I now run terraform show, I can see two things:

  1. Our ‘A’ nested block attribute exists in our state (with invalid value).
  2. Our ‘B’ nested block attribute is removed from the state.

Neither of which I expected because of the error returned when trying to create ‘A’.

I expected, once our Terraform provider had returned the error from the ‘A’ block’s Create function, because of the error, for no state changes to have been made.

Similarly, because the ‘B’ block’s Delete function was never actually called, I again expected there to be no state changes.

But clearly the state has been updated and this is confusing to me.

Any ideas why this might happen?

Thanks.

Can you share some code or logs?

My understanding is that only whole resources have Create functions (that Terraform knows about) - but you seem to be suggesting your blocks have Create functions too… is that an extra abstraction you have built on top of SDKv2 yourself?

Correct it’s an extra abstraction layer. I don’t fully understand it as I’ve inherited this code base but you can see the implementation here…

An example resource is terraform-provider-fastly/resource_fastly_service_vcl.go at main · fastly/terraform-provider-fastly · GitHub and an example nested block would be any of the listed items in that file, like…

Your code is incredibly complex, and I couldn’t figure out what was going on… but what you said made me think of some other weird behaviour I have seen, and I was successfully able to mock up a dummy provider that reproduced the issue.

It appears that when a resource Update function returns a Diagnostics containing an error, even though that error is reported to the user, Terraform SDKv2 is still committing the planned change to the state !!!

This feels like a massive bug to me.

AHA!

  1. // Although confusing,
  2. // it has been discovered that during an update when an error is returned, the
  3. // proposed config is set into state, even without any calls to d.Set.

Confusing is an understatement…

So it appears that the answer is, that one of the cryptic secrets of Terraform provider development in SDKv2, is that you have to call

			data.Partial(true)

before returning an error from an Update function, or the error will be ignored and the state updated anyway!

Though, if you’re mutating a complex object, that might mean that some changes you did apply successfully, before the error occurred, don’t get persisted either, if you were relying the default behaviour. You’d presumably have to figure out what did get applied, and data.Set(...) it.

Oh wow! Thank you @maxb for this excellent debugging work. Very much appreciated.

I recall reading about that partial function but for whatever reason I’ve never had cause to use it …until now!

It actually gets worse, I’m afraid… this Partial-ness applies to Update, but not to Create and Delete.

If one of your complex resources experiences an error during creation, none of the parts created already will be recorded in the state - they’ll be orphan objects running in production outside of Terraform control.

If one of your complex resources experiences an error during deletion, the state will not record the partial deletion - which may be OK, provided your resource read logic deals gracefully with things having been deleted outside of Terraform control.

1 Like

Do we know if this is an issue with the new framework?

And regardless, how is a provider supposed to handle an API error if the ‘resource’ isn’t a 1:1 mapping with an API but a 1:many mapping (one resource == multiple API calls)?

I’ve tried to use Partial and (as you say, due to it not being supported in Create/Delete) it hasn’t helped.

I then tried to manually re-run the Read operation from within the provider but the state file updates don’t get persisted because ultimately our primary ‘Update’ function returns the error triggered from the nested block that had the API error.

I even removed the line which would return the API error and would instead return the result of calling Read (thinking that the state changes from going to our API and getting the latest data, which for my example purposes is correct) but still the state I wanted to persist doesn’t show as being persisted by the time I run terraform show.

I’ve replied to your other post

1 Like