Terraform not adding dependencies on provider initialisation for data resources

Greetings,

What is the timeline for fixing data resource dependencies on initialised providers? The case is the following:

resource "azurerm_databricks_workspace" "this" {
   # ...
}

// explicitly depends on azurerm
provider "databricks" {
  host = azurerm_databricks_workspace.this.workspace_url
}

// implicitly depends on databricks
data "databricks_current_user" "me" {}

// explicitly depends on databricks and data resource
resource "databricks_secret_scope" "this" {
  name = "demo-${data.databricks_current_user.me.alphanumeric}"
}

the problem gets fixed by adding depends_on, but not all users read the docs.

data "databricks_current_user" "me" {
  depends_on = [azurerm_databricks_workspace.this]
}

so I feel that TF DAG must start discovering implicit dependencies.

Hi @nfx,

I’m not sure what exactly you mean by “data resource dependencies on initialised providers”. Since implicit dependencies are not something that exist in terraform, there unfortunately isn’t any timeline I can give you for fixing them. Was there a specific issue you were referring to?

In the example here, there is no way for terraform to infer that databricks_current_user.me depends on azurerm_databricks_workspace.this. In order to connect these nodes, something needs to inform terraform of their relation, and currently all dependency relationships come from the configuration.

but why can’t provider instance be part of the DAG?

or if databricks_secret_scope discovers dependency on azurerm_databricks_workspace.this, why cannot data resource data.databricks_current_user.me also discover that one? e.g. data resource should logically be dependent on provider being properly initialised.

Provider instances are certainly part of the dependency graph already, but I see where the confusion lies now.

There is no hidden discovery going on here, Terraform is using the dependencies declared in the configuration itself. The databricks resources all depend on their provider, which in turn depends on azurerm_databricks_workspace.this.workspace_url. This dependency is not missing from the graph, what is changing in the configuration is the behavior of the data source when you add depends_on.

Data sources are intended to be read as soon as possible during the planning process, so that their attributes can be used for planning purposes. However, the databricks provider in your example requires the value from a managed resource which can’t be known until after the apply is complete. This means that by default data.databricks_current_user.me will attempt to read the data during planning, but using an incomplete configuration (this is allowed for compatibility with providers than can plan correctly with incomplete configuration).

If the resulting configuration still works as desired, then adding depends_on is the correct workaround for this situation. From the Data Resource Dependencies documentation:

Setting the depends_on meta-argument within data blocks defers reading of the data source until after all changes to the dependencies have been applied

In most cases, this style of multi-level infrastructure is better suited to be applied from multiple configurations, so that you can ensure the required azurem infrastructure is in place before building out the databricks infrastructure on top of it. If combining the configurations works for your purpose, you can continue using it in this way, but understanding where the individual layers are separated conceptually will help diagnose similar issues that may arise.

@jbardin is there a way for provider to mark all of its data resources to be resolved only after managed resources are available? It makes sense for the “first layer” providers (aws, azurerm, google, …) to read data resources as soon as possible, but “second layer” providers (k8s, databricks, azuredevops, …) sometimes depend on some infrastructure still to be created.

I was looking through sdkv2, but wasn’t able to find anything. Forgetting to add depends_on block is one of the most common user complaints.

No, the order of operations is solely determined by the configuration. The output of data sources is often used in the planning process, so having a data source that could not be read during planning would be quite surprising to Terraform users. There aren’t any similar options in any SDK, since it would require a change to the provider protocol to implement.

I think most of these types of examples I’ve seen could not be solved by depends_on alone because the data source is feeding into a provider config or expansion expression. In many cases the problem can be solved by not using a data source at all. If the same configuration is creating the managed resource represented by the data source, computed attributes of that managed resource should be used instead (we usually recommend in general that a managed resource and data resource representing the same object not be used in the same configuration). This ensures the dependencies are ordered correctly, with the new values only being available during apply.

Unfortunately there are still many details about how individual resources interact that must be known to the module author. For example, there is no way for terraform to know when create_before_destroy needs to be used in order for resource replacement to work with certain types of dependencies.