Seeing very bad performance when for_each ~3k resources

Hey all,

I’m developing a terraform plugin and I’m hitting issues that I didn’t expect, namely around the performance of terraform when I use it to provision ~3k resources using one of my provider’s resources.

The code looks like this:

resource "incident_catalog_entry" "service" {
  for_each = local.catalog

  catalog_type_id = incident_catalog_type.service.id

  name = each.value["type_data"]["name"]

  attribute_values = [
      {
        attribute = incident_catalog_type_attribute.service_owner.id,
        value     = try(each.value["type_data"]["owner"][0], null)
      },
  ]
}

Where the catalog is loaded from a JSON file with ~3k entries. The confusing thing (for me, at least) is that this appears to be a performance issue in provider SDK code, which I’ve concluded by:

  • Confirming that using a different resource such as file plans just fine, even for 3k files
  • Profiling the binary: the provider isn’t making any calls into the resource external API, it appears to be spinning, and is wedged consuming all CPU cores on my machine

I’m actively looking into this now, including an alternative solution where I offer an entries resource that takes all entries as a single attribute, but wanted to post here to gut check this with people.

Should I expect ~3k entries to be out of scope for terraform to handle? I’ll restate: planning wedges my computer at full CPU use, and I’ve waited up to 15m before seeing no results.

Thanks!

3,000 resources is a lot for a single root module is generally a lot. I’d expect something of that scale to often be split into multiple root modules (i.e. different state files).

3,000 resources is a lot

Yeah, having been a long time terraform user I’d never have expected this. But from looking at the code I think there’s something accidentally quadratic in the SDK that means up to 1000 is fine, then you fall quickly off a cliff.

It’s useful to know people would view this as too many resources.

I wonder: have you seen any providers handle this type of thing, where you’re trying to bulk load data into a resource? It’s extremely convenient for us to allow provisioning like this via terraform and it seems terraform itself can handle this cardinality, just not the SDK.

So any other examples might be useful for me.

Hi @lawrencejones,

Sorry you are running into this issue. I believe that we technically don’t have an “upper-limit” on the number of resources we support although 3,000 resources is not a typical use case. If you believe that the root cause is with the SDK and this issue is easily replicable, then I would suggest you open an issue on the SDK’s Github repo with more info (schema, trace logs, config, etc) so that we can take a look at it.

Many providers can’t handle anything like that number of resources for various reasons. For example doing a refresh with the AWS provider where there are lots of resources is painfully slow due to all the API calls that need to be made. I think for some providers it is even worse as API limits get tripped and things just don’t work about a certain level.

I usually find reasons to split things at a much lower number of resources anyway - for example things split by functional area, responsible team or update cadence.

I once attempted to benchmark Terraform in a configuration used to create vault_policy resources, and another for various Vault group/group-alias resources.

I observed a pathological scaling behaviour of O(n cubed) with the number of resources.

I did not, at the time, dig deep enough to see whether the problem was in the core or provider side.

Unfortunately, it does look like Terraform is not currently well suited to managing things that there are legitimately thousands of instances of, that don’t break down conveniently into separate Terraform workspaces.

I think what you found is exactly what I’ve hit, where the performance goes off a cliff at a threshold of for_each.

I tried fixing this by moving to a single terraform resource which manages many sub-resources by reconciling itself internally. That worked from a terraform perspective, but I soon hit an issue with the provider SDK where the logging library has really poor performance:

If we can sort out the logging issue then I can see this strategy being a viable solution for large resources, though obviously the for_each issue for large numbers remains.