Why did terraform not recognise that affected resources have been affected?

apparentlymart · January 8, 2024, 10:21pm

I think what happened here is that the provider planned replacement of snowflake_warehouse with the same name before and after, and so snowflake_warehouse.warehouse.name (assigned to warehouse_name in snowflake_warehouse_grant) didn’t change and therefore the provider considered the new desired state identical to the old desired state.

Of course, you have some additional information that Terraform did not: that apparently this API silently deletes all of the “warehouse grant” objects when a warehouse is deleted. Because Terraform isn’t aware of that hidden interaction, it understands the removal of the role from the warehouse grant as a change made outside of Terraform – by the remote API itself, in this case – and so tries to repair it on the next run as would be the case if you’d changed something manually in the system’s management UI.

In today’s Terraform you can inform Terraform about that hidden interaction like this:

resource "snowflake_warehouse_grant" "warehouse_usage_grant" {
  warehouse_name = snowflake_warehouse.warehouse.name
  privilege      = "USAGE"
  roles          = [snowflake_role.role.name]

  lifecycle {
    replace_triggered_by = [
      snowflake_warehouse.warehouse.name,
    ]
  }
}

In an ideal world, the provider would’ve automatically informed Terraform about that relationship between these two objects so that you would not need to write anything. However, the Terraform provider protocol currently lacks any way to describe such relationships. That gap is what this design issue is about:

github.com/hashicorp/terraform

Global IDs for representing relationships between resource objects (object containment, name collision detection, etc)

opened 07:19PM - 16 Jul 19 UTC

apparentlymart

enhancement core thinking providers/protocol

This is a description of a problem space and some initial sketches for how it mi…ght be solved. It's not yet a fully actionable proposal, since we need to gather more examples and do prototyping with them to figure out exactly what the problem cases are and thus how best to solve them. For now, this issue is here mainly so it can be mentioned in other issues (e.g. in provider repositories) that describe use-cases where this mechanism might be beneficial. As a consequence, anything here is subject to change in subsequent discussion. ---- Terraform currently considers each remote object to be entirely distinct from others. That includes (but is not limited to) the following incorrect assumptions: * Deleting one object does not implicitly delete or modify any other objects, and can be done independently of the existence of other objects. * Updating one object in-place never affects the state of another object. * Creating a new object can never conflict with an existing object. The above assumptions are clearly false for many real-world vendor APIs, though in practice we've been able to work around most of them in one way or another. In some cases that requires special care on the part of the user though, which can be problematic if violating the assumption has a negative effect such as system downtime or Terraform becoming "stuck" and unable to make progress. Based on real-world experience with existing APIs, it seems like Terraform could benefit from explicit modelling of relationships between resource objects that are richer than what can be inferred only from the user-provided dependency graph. In principle providers could use their knowledge about the remote system to give Terraform more information about these relationships, and then Terraform could use that information to prevent certain obviously-incorrect actions and to generate warnings about situations that are less certain. The remainder of this issue is some notes about a possible way to achieve that, and some initial ideas about how it might be used. This initial sketch is mainly serving as a request for example use-cases to inform a next iteration of it, and not something that is currently ready to implement. ## Global Object IDs Prior to Terraform 0.12, Terraform required all resource instance objects to have an associated `id` attribute, but imposed no requirement on how providers would use it other than that it must not be an empty string. In practice, that requirement didn't really serve any purpose from Terraform Core's standpoint, and so from Terraform 0.12 onwards there is no such requirement at the Core level, though as I write this the SDK does still impose that requirement for 0.11-compatibility reasons. However, having a more strongly-defined sense of an ID for an object -- one that is global in scope and allows Terraform Core to make certain assumptions about it -- could be a useful building block for modelling relationships between objects. Some of the remote systems we interact with already have a sense of ids that are global to their entire system. For example, AWS has the idea of an "ARN" which can uniquely identify a particular object across the whole of AWS, including not only the service-local unique identifier but also the overall AWS account the object belongs to and (where appropriate) the service region it was created in. We can potentially generalize this idea by allowing each Terraform provider to define its own unique id scheme. The provider itself would control that scheme but Terraform Core would make certain assumptions about it that the provider must ensure are valid: * Each remote object has an id completely distinct from all others. * The unique id includes enough information to be unique across any possible provider configuration. (For example, for services that use regional namespaces selectable in the provider configuration, the id must include the region that was active when each object was created.) * The ID for a particular object is stable over time. That is, upgrading to a new version of the provider won't cause references to an ID to become dangling, unless the target object legitimately no longer exists. Because the requirements for each remote system are different, Terraform Core would impose only a simple syntax requirement on these ids: they must be strings and they must start with the provider type name followed by a colon. After the colon can be any valid sequence of Unicode printable characters. If the remote system already has a suitable global ID syntax, it may be best to just use that directly in case these ids are seen by users (though ideally they should not be). For example, any id generated by the `azurerm` provider must begin with `azurerm:` but can then be followed by any any printable Unicode characters needed to fully describe the identity of an object. In practice I suspect we might elect to allow each remote object to have potentially _multiple_ global object IDs, as a way to handle changes in the format over time (can report both the old and new forms at once) and to deal with any other unavoidable ambiguity that might arise. In that cases though, each distinct ID string should still only be associated with one object. Not all objects need to have global IDs. Firstly, if we were to introduce a feature like this then necessarily it would start with most existing providers not supporting it universally, and even after it's been around for a while the global ID mechanism would serve no purpose for certain object types. In particular, there's no reasonable global persistent ID for many of the transient in-state-only object types that are offered by providers like `null`, `tls`, etc. ## Potential Uses for Global IDs The following sections describe some situations we've already encountered that Global IDs might be useful for. There are likely other ways these problems could be addressed too, so this section is mainly here just as a set of examples to help us identify other problems that we might be able to address through the introduction of Global IDs. ### Detecting Object Collisions A straightforward use of Global IDs is to automatically detect and flag when two objects in the same state have the same Global ID. That suggests a user error (defining the same object twice) and ought to be resolved somehow before proceeding, or Terraform's behavior would otherwise be unpredictable. Another variant of this is situations where the provider has enough information available at plan time to predict one or more specific Global IDs for an object that hasn't been created yet. That would then potentially allow Terraform to detect collisions during planning and prevent them from occurring in the first place. Terraform will not always have sufficient information to detect this at plan time (if the Global ID is derived from values that won't be known until apply time), but in that case it would degenerate to the first case above of detecting the conflict during a subsequent plan and requiring some sort of resolution. (Exactly what resolution would be possible/appropriate is an open question; perhaps Terraform would require removing all but one of the conflicting `resource` blocks but then skip creating `Delete` actions for those in the plan, assuming the user is intending the still-remaining `resource` block to be the "owner" of that previously-shared object.) ### "Containment" relationship Many remote systems have a sense of one domain object being "contained within" another, which for the sake of this section we'll define as where the container object must outlive all of the contained objects. There are two main variants of this we've seen across many systems: * As long as contained objects exist, the container cannot be deleted. * Deleting the container implicitly deletes all of the contained objects. Both of these situations violate Terraform's current assumptions. In the first case this can result in apply-time failures or timeouts, while the second case is more problematic in that it will tend to cause Terraform state to go out of sync with reality because Terraform cannot see that the contained objects have been deleted. To address this, we could potentially augment the resource instance object state model so that each object can record: * A set of Global IDs that the object is _contained within_. * A set of Global IDs that the object _contains_. While storing both directions of this relationship is redundant in the case where all objects are in the same configuration, it is possible (and, perhaps, common) for the objects to be split across two separate configurations by making use of data sources, and so the bidirectional tracking gives Terraform a fuller picture of the relationships in such cases. The intent of these two sets is that they would be set by the provider during any changes, but also would be refreshed by the provider during a refresh operation, probably by calling an API to query the relationship. As a specific example, consider that `aws_subnet` resources are always contained within `aws_vpc` resources: it's not possible to delete a VPC as long as at least one subnet exists. In this case it is a many-to-one relationship represented in the API as a foreign key on the subnet side, so the `aws_subnet` implementation can trivially determine the Global ID of the single VPC the subnet belongs to without any further queries (it's a transform of the `vpc_id` attribute), but the `aws_vpc` implementation would need to additionally call `DescribeSubnets` during refresh to properly populate the set of subnets that are contained within it, even if they were created in a different configuration. Terraform Core can use this information to produce a more accurate plan whenever a container is planned for destruction. Terraform Core might see that a `Delete` action is planned for an `aws_vpc` and thus also automatically plan `Delete` actions for the associated subnets in the same configuration. If there are any contained subnets that are _not_ known in the current workspace state, Terraform could return an error saying that these contained objects must be destroyed first, and thus leave the human operator to decide which other Terraform configuration must be changed to achieve that. The containment relationship also allows for improving Terraform's behavior in the more complex case of `DeleteThenCreate` or `CreateThenDelete` actions: this additional information might allow Terraform to understand both that it needs to replace all of the subnets when a containing VPC is replaced _and_ that these objects are related in a way that requires a specific ordering of the destroy and create actions to produce a correct result. ## Referring to Objects in the UI The above use-cases include situations where Terraform Core must report a problem to the user that will include references to involved objects. Since the global IDs are not necessarily user-friendly, we might elect to have a mechanism to ask a provider to generate a human-friendly (but potentially slightly ambiguous) name for a given global ID. For example, while AWS VPC objects are a per-region namespace in principle, in practice collisions between regions are very unlikely within a particular user's infrastructure and so it is common to talk about VPCs and subnets using just their region-local ids, without qualifying them with a region. The AWS provider might elect to transform a full VPC ARN into just a `vpc-abc123`-like string for display to the user, assuming that the user will have enough context to understand which region is relevant, and intentionally excluding the AWS account id because VPC IDs never overlap between two AWS accounts. ## Relationships Between Providers A key feature of Terraform is its ability to easily pass data between objects in entirely different systems. For example, the IP address of a created compute instance might be sent to a separate DNS vendor to create a DNS record. It's not clear yet whether there are use-cases for representing Global ID-based relationships between objects in different providers. If there are then the global nature of these identifiers would make that possible, but that then imposes an additional compatibility constraint on each provider as the details of its global ID formats would be embedded in the logic of other providers. Until we identify a specific use-case for representing a cross-provider relationship, I suggest we forbid it to start. Then if a use-case is found later we can use that real example to figure out what constraints ought to apply in that cross-provider case, rather than risking being constrained by a naive design not informed by use-cases. ## Sidebar: Global Object IDs for multi-instance systems The idea of allocating global object ids maps nicely onto hosted (SaaS, etc) systems where the namespace of objects is physically fixed to a particular vendor and no other instances are available. It's trickier for self-hosted software and other situations where the physical location of the remote system is part of its unique identifier. For example, the `mysql` provider is configured with a hostname or IP address for the specific MySQL server to talk to. If the server has a stable, meaningful hostname then using that hostname as part of the identifier is reasonable, but in modern ephemeral environments such services often don't have stable locations and are instead located via a service discovery system, which may not be implemented via DNS lookups. How to robustly allocate global object IDs for this class of remote system is an open question still to be resolved. A key requirement is that it be possible to move the system to another physical address without implicitly renaming all of its existing global IDs, which seems likely to involve introducing some sort of user-controlled "logical location" that is distinct from the physical location and can persist as the service moves between physical locations, but without imposing operational constraints on the service such as being at a stable hostname.

That issue discusses a way to allow providers to “talk about” related objects when planning changes for a specific object, which would then in principle allow the provider to express something like “deleting object A implicitly deletes object B”, which would then in turn allow Terraform to infer the extra actions required to deal with the implicit change, so that you’d no longer need to explicitly configure replace_triggered_by.

However, that issue is really just a problem statement with only very early ideas on how to solve it. I have some further work on this in a non-public place where I collected some more specific examples from discussions with provider developers, and it does still seem like a promising direction, so hopefully eventually we’ll have time to do some more concrete design work for it, and then implement something in this area.

Topic		Replies	Views
Custom terraform provider, update issue Plugin Development	21	2192	April 26, 2023
Trying to understand why why terraform destroys lots of IAM resources because members are provided by a data source Terraform	7	2619	April 14, 2022
Root resource was present, but now absent in terraform? Terraform azure	12	28021	July 21, 2022
Unless you have made equivalent changes to your configuration, or ignored the relevant attributes using ignore_changes Terraform	12	4058	January 13, 2023
Possible bug in Snowflake provider: no schema available for snowflake_account_grant while reading state Terraform	2	881	February 5, 2025

Why did terraform not recognise that affected resources have been affected?

Related topics