Access to plan with new Terraform terraform-plugin-framework

:wave:t2:

With the new terraform-plugin-framework is it possible for a single resource to see the complete plan and not just the plan for itself?

I won’t go into the details of it but I have a situation/provider design where it would be useful for a resource to be able to see what is in the overall plan (i.e. identify how many other resources are currently being modified).

Thanks!

No, it is not - and it doesn’t matter what libraries you’re working with, it’s just not part of the Terraform<->provider contract, which is defined in terms of protobuf so, is fairly easy to find in the Terraform code.

Hmm. Instead of an individual resource seeing the complete plan, is there a way that the plan as a whole can be seen by the provider somewhere/somehow?

I imagine not, but without getting into the weeds of things it’s something that will help me reduce a large/complex provider that is designed around hacking in ‘nested resources’ into the provider and flattening that all out so the provider starts looking more like a typical provider.

Otherwise for me to ‘flatten’ the design of this provider would require a two-step apply process which just kills automation.

No, again, my previous answer remains true in full.

Can you disclose more information about what this provider does?

Unless you do, there’s not much I can do, other than wonder if Terraform is even the right tool for whatever you’re doing.

Hi @maxb

So I absolutely can provide more details about the provider, the reason I’ve been cagey about providing details is that it can end up being a lot of ‘textual noise’ vs just asking a more direct question :slight_smile:

But in this case I think yeah it may well help to elaborate.

So, the API I’m dealing with has lots of endpoints for different resources. One of these resources is a ‘service’ resource.

Now the way a ‘service’ works is that when you create a service it is ‘inactive’, meaning as a consumer of the service you can’t access it, not until the service has been ‘activated’.

A service has different ‘versions’. If you activate your service, and you then want to make changes to it, you have to ‘clone’ the service which produces a new version and that new version is inactive. You make your changes to the new version, then you activate it.

All other resources (there’s about 43 separate resources, such as a ‘backend’ resource, a ‘domain’ resource etc) must be created within an ‘inactive’ service version (each resource consists of its own separate CRUD API endpoints).

So in the old provider implementation the developers had hacked in the concept of ‘nested resources’ (e.g. the ‘service’ resource would have a ‘backend’ block and a ‘domain’ block etc and although it looks like a ‘block’ to a user of the provider, the provider internally was doing CRUD/resource based handling of each block, and then after each block is processed, then the service resource triggers its ‘activation’).

So when it came time for me to rewrite the provider to use the new Terraform framework I started implementing the same ‘nested resource’ design to keep things consistent and not break the interface too much for users.

But ultimately the logic required to implement such a design is gnarly and complex and I’d rather avoid it by ‘flattening’ the design.

So instead of…

resource "service" "example {
  backend {
    ...
  }
  domain {
    ...
  }
}

I’d prefer something like…

resource "service" "example" {
  ...
}

resource "domain" "example" {
  service_version = service.example.version
  name = "demo.example.com"
}

resource "backend" "example" {
  service_version = service.example.version
  ...
}

Now for this design to work I need a way to ‘activate’ the service once all the resources are complete (e.g. once domain has been created inside the inactive service version, and backend has been created inside the inactive service version, and repeat for the other 40+ resources a user might define in their config).

So initially I was thinking…

resource "service_activation" "example" {
  service_id = service.example.id
  activate   = true
}

So we know that we’re implying a dependency graph on the service resource by referencing its output variable but that isn’t going to cause the service_activation resource to wait for the domain and backend resources to have been completed.

The initial way we were looking to solve this was to just use depends_on in the final service_activation resource. The reason we were considering avoiding that was because of the Terraform documentation which states…

You should use depends_on as a last resort because it can cause Terraform to create more conservative plans that replace more resources than necessary. For example, Terraform may treat more values as unknown “(known after apply)” because it is uncertain what changes will occur on the upstream object. – Terraform docs

The other reason was I wanted to avoid users have to define a potentially very long depends_on list (like I say, there are 43 or more resources they could end up with needing to list).

But considering I was unable to figure out a way to avoid this with just internal logic (see below for ideas I had for how to do this) I might have to rely on depends_on and just hope that it doesn’t mess up any diffs to the point where they’re full of “known after apply”.

So to avoid depends_on I started thinking about using Go’s sync.WaitGroup and having each resource have access to the WaitGroup so they can call wg.Add(1) and then defer wg.Done() but I discovered that Terraform processes each resource in parallel so we have a race condition where maybe the domain resource calls wg.Add(1) before the service_activation resource gets a chance to call its wg.Wait() but maybe the wait method is reached before the backend resource got processed and before it called its wg.Add(1).

So now I’m trying to figure out, how can I have this better ‘flatter’ design that’s more idiomatic to Terraform but have the service_activation resource wait for a non-deterministic number of resources to complete (e.g. I don’t know what the user’s TF config looks like, nor if I did would I want to try and parse it because ultimately I need to know what the ‘plan’ looks like so if the domain resource was edited or a new backend resource was added, then I know there are two resources that are going to be processed and need to be waited on before trying to activate the service).

Hopefully this all made sense. Would be interested in your thoughts.

Thanks!

Ah. Well, that does explain the situation :-/

I understand why you want to flatten into multiple resources. The aws_s3_bucket resource is a well-known example of a resource that did a lot of things in one, and is now on a journey towards multiple resources.

However the big problem here is this inactive/active publication model.

AFAICS, you’re stuck handling this as once complicated resource, because otherwise, you don’t have enough control over the processing order, to do things in the way the upstream API demands.

For example, if you do have a separate domain resource, and something in it changes, Terraform isn’t going to know that it then needs to generate a new inactive version in which to make the changes.

Sorry I can’t be more help… it’s just that the architectures of Terraform and this upstream API don’t mesh very well.

That’s OK, no worries. I think I’m going to trial it out flattened and see what sort of effect using depends_on with the service_activation resource results in. Hopefully the diffs are still useful.

But yeah that other issue you mentioned was something I realised as well :grimacing:

If there are no ‘service’ changes, but a ‘backend’ changes, then what’s going to happen is that the backend resource would be passed a service version that is ‘active’ and so their call to create a backend is going to fail.

So I was thinking I could work around that. The idea I had was that we have each resource use the version that’s passed in (which would be ‘active’ in this scenario as there were no changes to the service to cause it to run and generate an inactive service version) in a call to our API where we can get back metadata about the service, such as is the give service version ‘active’. If we have an active service version, then we call our ‘clone service version’ API.

Now the problem with that is if the ‘backend’ resource clones the service version, then what happens when another resource (like a ‘domain’ resource) is processed? It’s going to end up cloning the same ‘active’ service version and result in yet another inactive version which is different to the inactive version the backend resource is now working with.

It’s a similar issue then for the service_activation resource which would not be able to activate the reported service version because that would be the active one, and it’s also not going to know which ‘inactive’ version to use (it can’t just clone the service as backend and domain have likely cloned it twice already).

So one way to solve that issue would be to have a global integer that’s not set by default (e.g. an int type’s zero value in Go is zero). Then each resource can acquire a lock around the value and if it’s zero they clone the service and update the variable and release the lock, then the next resource sees that the variable isn’t zero and so it doesn’t clone it, it just uses whatever value is present as the service version.

Then the service_activation resource would do a similar thing where it checks the service version that’s passed in, and if it’s ‘active’ then it knows there’s no changes to that resource and subsequently it will look at the global integer and discover the variable is set to a number which it then uses for its activation call.

That’s going to break as soon as you have more than one ‘service’ resource in the same Terraform configuration.

I don’t think you should go down this path. I think you’ll ultimately regret it, and your users will find it to be too messy.

But if you do it anyway, you could potentially make use of replace_triggered_by

resource "service" "example" {
  lifecycle {
    replace_triggered_by = [
      domain.example,
      backend.example,
    ]
  }
}

resource "domain" "example" {
  service_version = service.example.version
  name = "demo.example.com"
}

resource "backend" "example" {
  service_version = service.example.version
  ...
}

I really don’t think you should actually do this. But the feature does exist.

Thanks @maxb for this info, yeah I think replace_triggered_by might be too heavy handed as we don’t want the service to be replaced (i.e. deleted/created).

And yeah I see what you mean about the global variable. Again it can be worked around using a map type, but it’s all sorts of complexity (but to be fair no worse that the complexity we have in the current implementation where we’ve got full CRUD lifecycle resources within the top-level service resource :grimacing:).

Definitely something to stew over. Thanks again for your advice/feedback :+1: