Please suggest how to profile terraform execution to get some useful metrics.
The origin of the question is that we have ~1K modules wrapped into tree structure with terragrunt and executed with python.
I can timeit execution itself and parse resources section of state file.
But I missing internals metrics to come up with module size metric, num of api calls e.g.
To things I’m trying to achive:
optimize modules to avoid unnecessary calls
deduce complexity of modules driven by some rationale numbers.
For most (but not necessarily all) Terraform configurations, the most significant delays at runtime come from waiting for remote network APIs to respond, or to eventually become consistent.
You can potentially measure those by running terraform plan -json -out=tfplan and terraform apply -json tfplan, where the -json argument will ask Terraform to produce machine-readable output in the form of a stream of JSON objects written to stdout.
If you write a wrapper program to consume that output and record the arrival times of certain interesting events then you should be able to determine which operations are taking the longest.
I have some workspaces that reliably take multiple hours to plan, even with -refresh=false (and confirming with TF_LOG=trace that they’re making no API calls). It’s not even a provider issue, as far as I can tell: the provider completes its work in a millisecond, and instead it’s Terraform’s core that eats up all my cores doing nothing interesting.
It’d be really nice to have a way to dig into profiles to try and understand what’s happening.
If it’s an inefficiency in Terraform Core itself then I expect we’d need to use Go profiling tools to get into that, since the Terraform language runtime only has hooks around the external events it’s orchestrating, not around its own CPU-bound work.
A full execution profile isn’t usually needed, and won’t help much until one is working on some specific optimization. The first thing I would do is look for gaps in the trace log timestamps. That will usually narrow down the problem sufficiently to indicate what the slow operation is.
Given the known limitations of Terraform, there are two common sources of slowness during the plan:
an excessively large and highly connected configuration graph can be slow to process
Many references to resources with very large numbers of instances
Unfortunately, the usual recommendation of “break this workspace apart into smaller workspaces” isn’t viable. We have a monolith. We end up having lots of security rules that all apply to the same context. I could split them up alphabetically, but that seems like a really silly workaround.
I get that performance isn’t the most important concern with Terraform, but having plan time scale exponentially with the number of resource instances is really troublesome for large-scale deployments. Sure, we can sometimes break things apart (and we do, as much as possible, despite the operational overhead we incur by doing that), but sometimes there really isn’t a better way to model the resources