Seeing very bad performance when for_each ~3k resources

lawrencejones · April 10, 2023, 5:02pm

Hey all,

I’m developing a terraform plugin and I’m hitting issues that I didn’t expect, namely around the performance of terraform when I use it to provision ~3k resources using one of my provider’s resources.

The code looks like this:

resource "incident_catalog_entry" "service" {
  for_each = local.catalog

  catalog_type_id = incident_catalog_type.service.id

  name = each.value["type_data"]["name"]

  attribute_values = [
      {
        attribute = incident_catalog_type_attribute.service_owner.id,
        value     = try(each.value["type_data"]["owner"][0], null)
      },
  ]
}

Where the catalog is loaded from a JSON file with ~3k entries. The confusing thing (for me, at least) is that this appears to be a performance issue in provider SDK code, which I’ve concluded by:

Confirming that using a different resource such as file plans just fine, even for 3k files
Profiling the binary: the provider isn’t making any calls into the resource external API, it appears to be spinning, and is wedged consuming all CPU cores on my machine

I’m actively looking into this now, including an alternative solution where I offer an entries resource that takes all entries as a single attribute, but wanted to post here to gut check this with people.

Should I expect ~3k entries to be out of scope for terraform to handle? I’ll restate: planning wedges my computer at full CPU use, and I’ve waited up to 15m before seeing no results.

Thanks!

stuart-c · April 10, 2023, 6:24pm

3,000 resources is a lot for a single root module is generally a lot. I’d expect something of that scale to often be split into multiple root modules (i.e. different state files).

lawrencejones · April 10, 2023, 8:02pm

3,000 resources is a lot

Yeah, having been a long time terraform user I’d never have expected this. But from looking at the code I think there’s something accidentally quadratic in the SDK that means up to 1000 is fine, then you fall quickly off a cliff.

It’s useful to know people would view this as too many resources.

I wonder: have you seen any providers handle this type of thing, where you’re trying to bulk load data into a resource? It’s extremely convenient for us to allow provisioning like this via terraform and it seems terraform itself can handle this cardinality, just not the SDK.

So any other examples might be useful for me.

SBGoods · April 10, 2023, 9:04pm

Hi @lawrencejones,

Sorry you are running into this issue. I believe that we technically don’t have an “upper-limit” on the number of resources we support although 3,000 resources is not a typical use case. If you believe that the root cause is with the SDK and this issue is easily replicable, then I would suggest you open an issue on the SDK’s Github repo with more info (schema, trace logs, config, etc) so that we can take a look at it.

stuart-c · April 10, 2023, 9:36pm

Many providers can’t handle anything like that number of resources for various reasons. For example doing a refresh with the AWS provider where there are lots of resources is painfully slow due to all the API calls that need to be made. I think for some providers it is even worse as API limits get tripped and things just don’t work about a certain level.

I usually find reasons to split things at a much lower number of resources anyway - for example things split by functional area, responsible team or update cadence.

maxb · April 11, 2023, 1:42pm

I once attempted to benchmark Terraform in a configuration used to create vault_policy resources, and another for various Vault group/group-alias resources.

I observed a pathological scaling behaviour of O(n cubed) with the number of resources.

I did not, at the time, dig deep enough to see whether the problem was in the core or provider side.

Unfortunately, it does look like Terraform is not currently well suited to managing things that there are legitimately thousands of instances of, that don’t break down conveniently into separate Terraform workspaces.

lawrencejones · April 11, 2023, 2:55pm

I think what you found is exactly what I’ve hit, where the performance goes off a cliff at a threshold of for_each.

I tried fixing this by moving to a single terraform resource which manages many sub-resources by reconciling itself internally. That worked from a terraform perspective, but I soon hit an issue with the provider SDK where the logging library has really poor performance:

github.com/hashicorp/terraform-plugin-framework

Trace and debug logging is extremely expensive (x32 slower) for large resource diffs

opened 09:12AM - 11 Apr 23 UTC

lawrencejones

bug

Hey team 👋 Was unsure if this would come under bug or enhancement, but given… that this is preventing my provider from creating a plan, I went for bug. tl;dr: we have a resource that accepts a large and deeply nested object. Trace and debug logs produced by the SDK around setting defaults/state modifications/etc are so verbose that it causes planning to become much more expensive (16s -> 527s, so 32x). Having profiled the provider as it produces the plan, it appears the majority of the cost comes from the context based logger. Generally logging this much can be expensive, but particularly reflection-based runtime loggers that pull context.Value frequently, as the time adds up. Attaching the profile below where you can see of the 28.52s sample, we spent 22.38s in `context.(*valueCtx).Value` as part of the logging library, and the rest in logging adjacent functions. ``` (pprof) top 100Showing nodes accounting for 28.08s, 98.46% of 28.52s totalDropped 105 nodes (cum <= 0.14s) flat flat% sum% cum cum% 15.34s 53.79% 53.79% 19.21s 67.36% runtime.efaceeq 3.76s 13.18% 66.97% 3.77s 13.22% runtime.strequal 3.49s 12.24% 79.21% 3.49s 12.24% syscall.syscall 3.16s 11.08% 90.29% 22.38s 78.47% context.value 1.50s 5.26% 95.55% 1.50s 5.26% runtime.madvise 0.66s 2.31% 97.86% 0.66s 2.31% runtime.memclrNoHeapPointers 0.16s 0.56% 98.42% 0.20s 0.7% runtime.scanobject 0.01s 0.035% 98.46% 0.76s 2.66% runtime.mallocgc 0 0% 98.46% 22.38s 78.47% context.(*valueCtx).Value 0 0% 98.46% 4.21s 14.76% github.com/hashicorp/go-hclog.(*intLogger).Trace 0 0% 98.46% 4.29s 15.04% github.com/hashicorp/go-hclog.(*intLogger).log 0 0% 98.46% 0.80s 2.81% github.com/hashicorp/go-hclog.(*intLogger).logJSON 0 0% 98.46% 3.49s 12.24% github.com/hashicorp/go-hclog.(*writer).Flush 0 0% 98.46% 0.65s 2.28% github.com/hashicorp/go-hclog.intLogger.jsonMapEntry 0 0% 98.46% 26.76s 93.83% github.com/hashicorp/terraform-plugin-framework/internal/fwserver.(*Server).PlanResourceChange 0 0% 98.46% 26.76s 93.83% github.com/hashicorp/terraform-plugin-framework/internal/fwserver.MarkComputedNilsAsUnknown.func1 0 0% 98.46% 0.67s 2.35% github.com/hashicorp/terraform-plugin-framework/internal/logging.FrameworkDebug (inline) 0 0% 98.46% 26.07s 91.41% github.com/hashicorp/terraform-plugin-framework/internal/logging.FrameworkTrace (inline) 0 0% 98.46% 26.76s 93.83% github.com/hashicorp/terraform-plugin-framework/internal/proto6server.(*Server).PlanResourceChange 0 0% 98.46% 26.76s 93.83% github.com/hashicorp/terraform-plugin-go/tfprotov6/internal/tfplugin6._Provider_PlanResourceChange_Handler 0 0% 98.46% 26.76s 93.83% github.com/hashicorp/terraform-plugin-go/tfprotov6/tf6server.(*server).PlanResourceChange 0 0% 98.46% 26.76s 93.83% github.com/hashicorp/terraform-plugin-go/tftypes.Transform (inline) 0 0% 98.46% 26.76s 93.83% github.com/hashicorp/terraform-plugin-go/tftypes.transform 0 0% 98.46% 22.39s 78.51% github.com/hashicorp/terraform-plugin-log/internal/logging.GetSDKSubsystemLogger 0 0% 98.46% 0.67s 2.35% github.com/hashicorp/terraform-plugin-log/tfsdklog.SubsystemDebug 0 0% 98.46% 26.07s 91.41% github.com/hashicorp/terraform-plugin-log/tfsdklog.SubsystemTrace 0 0% 98.46% 26.76s 93.83% google.golang.org/grpc.(*Server).handleStream 0 0% 98.46% 26.76s 93.83% google.golang.org/grpc.(*Server).processUnaryRPC 0 0% 98.46% 26.76s 93.83% google.golang.org/grpc.(*Server).serveStreams.func1.2 0 0% 98.46% 3.49s 12.24% internal/poll.(*FD).Write 0 0% 98.46% 3.49s 12.24% internal/poll.ignoringEINTRIO (inline) 0 0% 98.46% 3.49s 12.24% os.(*File).Write 0 0% 98.46% 3.49s 12.24% os.(*File).write (inline) 0 0% 98.46% 1.54s 5.40% runtime.(*mheap).alloc.func1 0 0% 98.46% 1.54s 5.40% runtime.(*mheap).allocSpan 0 0% 98.46% 0.20s 0.7% runtime.gcBgMarkWorker 0 0% 98.46% 0.20s 0.7% runtime.gcBgMarkWorker.func2 0 0% 98.46% 0.20s 0.7% runtime.gcDrain 0 0% 98.46% 0.62s 2.17% runtime.mapassign_faststr 0 0% 98.46% 0.59s 2.07% runtime.newobject (inline) 0 0% 98.46% 1.50s 5.26% runtime.sysUsed (inline) 0 0% 98.46% 1.50s 5.26% runtime.sysUsedOS (inline) 0 0% 98.46% 1.75s 6.14% runtime.systemstack 0 0% 98.46% 3.49s 12.24% syscall.Write (inline) 0 0% 98.46% 3.49s 12.24% syscall.write ``` ### Module version ``` github.com/hashicorp/terraform-plugin-framework v1.2.0 ``` Definitely running latest as I've pulled the repo locally to try investigating this. ### Relevant provider source code PlanResourceChange (https://github.com/hashicorp/terraform-plugin-framework/blob/main/internal/fwserver/server_planresourcechange.go#L44) is where we're spending all our time, which is due to the many log calls. Applying this patch is what gives the 32x improvement in plan speed: ```diff diff --git a/internal/logging/framework.go b/internal/logging/framework.go index eb3e8da..f704972 100644 --- a/internal/logging/framework.go +++ b/internal/logging/framework.go @@ -13,6 +13,8 @@ const ( // FrameworkDebug emits a framework subsystem log at DEBUG level. func FrameworkDebug(ctx context.Context, msg string, additionalFields ...map[string]interface{}) { + return + tfsdklog.SubsystemDebug(ctx, SubsystemFramework, msg, additionalFields...) } @@ -23,6 +25,8 @@ func FrameworkError(ctx context.Context, msg string, additionalFields ...map[str // FrameworkTrace emits a framework subsystem log at TRACE level. func FrameworkTrace(ctx context.Context, msg string, additionalFields ...map[string]interface{}) { + return + tfsdklog.SubsystemTrace(ctx, SubsystemFramework, msg, additionalFields...) } ``` ### Terraform Configuration Files I've got a big nested `entries` attribute on my resource that looks like this: ``` resource "incident_catalog_entries" "services" { id = incident_catalog_type.service.id entries = { for _, entry in local.catalog : # Map from our external ID => value for each of the attributes. entry["id"] => { name = entry["type_data"]["name"], attribute_values = { // lots of values here } } } } ``` I can share the provider but I think the only part that is probably relevant is the schema: ```go func (r *IncidentCatalogEntriesResource) Schema(ctx context.Context, req resource.SchemaRequest, resp *resource.SchemaResponse) { resp.Schema = schema.Schema{ MarkdownDescription: apischema.TagDocstring("Catalog V2"), Attributes: map[string]schema.Attribute{ "id": schema.StringAttribute{ MarkdownDescription: apischema.Docstring("CatalogEntryV2ResponseBody", "catalog_type_id"), PlanModifiers: []planmodifier.String{ stringplanmodifier.RequiresReplace(), }, Required: true, }, "entries": schema.MapNestedAttribute{ Required: true, NestedObject: schema.NestedAttributeObject{ Attributes: map[string]schema.Attribute{ "id": schema.StringAttribute{ MarkdownDescription: apischema.Docstring("CatalogEntryV2ResponseBody", "id"), Computed: true, PlanModifiers: []planmodifier.String{ stringplanmodifier.UseStateForUnknown(), }, }, "name": schema.StringAttribute{ MarkdownDescription: apischema.Docstring("CatalogEntryV2ResponseBody", "name"), Required: true, }, "alias": schema.StringAttribute{ MarkdownDescription: apischema.Docstring("CatalogEntryV2ResponseBody", "alias"), Optional: true, }, "rank": schema.Int64Attribute{ MarkdownDescription: apischema.Docstring("CatalogEntryV2ResponseBody", "rank"), Optional: true, Computed: true, Default: int64default.StaticInt64(0), }, "attribute_values": schema.MapNestedAttribute{ Required: true, NestedObject: schema.NestedAttributeObject{ Attributes: map[string]schema.Attribute{ "value": schema.StringAttribute{ Description: `The value of this attribute, in a format suitable for this attribute type.`, Optional: true, }, "array_value": schema.ListAttribute{ ElementType: types.StringType, Description: `The value of this element of the array, in a format suitable for this attribute type.`, Optional: true, }, }, }, }, }, }, }, }, } } ``` ### Debug Output Debug and trace logs kinda are the problem here, so I'll point people at the top profile output. ### Expected Behavior My plan executes in a normal time (<30s). ### Actual Behavior My plan takes so long to execute that most people would give up (30s+, actually only finishes ~10m later). ### Steps to Reproduce Build a highly nested object with many entries and attempt a plan. ### References Not that I'm aware. I did search for `logging` and `performance` in this repo and a few others, but couldn't find anything.

If we can sort out the logging issue then I can see this strategy being a viable solution for large resources, though obviously the for_each issue for large numbers remains.

Topic		Replies	Views
Help with "for_each" in resource Terraform	3	2270	August 22, 2019
Help with Index & List Logic Terraform	8	6534	August 20, 2019
Techniques to Extremely slow plan Terraform	0	440	November 7, 2019
New to terraform, trying to use for_each Terraform	2	813	April 22, 2020
Terraform cloud runtime out of memory HCP Terraform	1	1860	July 27, 2020

Seeing very bad performance when for_each ~3k resources

Related topics