My 25-resource Terraform Project is Taking 3 Minutes to Plan

Hi there, I have a new Terraform repo I’ve been experimenting with:

https://github.com/lancejpollard/cloud

I am new to Terraform, but have run into all kinds of gotchas in terms of syntax and capabilities, it’s very limiting. Currently I have an example that creates 26 resources in the California region for AWS, in one availability zone, but it’s setup to handle all regions and availability zones. It takes 3 minutes - 5 minutes to compute the plan…

Is this normal? What might I be doing wrong? It’s hard to iterate on the terraform plan if it takes 3-5 minutes each time to run terraform plan… Would like it to take like 20 seconds or less.

Hi @lancejpollard,

I wasn’t able to review your configuration in sufficient detail to explain what exactly is going on here; perhaps you can see in the output some specific things that are taking a particularly long time, in which case I could try to explain why that might be.

With that said, it’s not typical to manage an entire multi-region infrastructure deployment in a single Terraform configuration. Usually a larger deployment would be split into smaller units – one per region is a common first level of decomposition, and then possible further decomposition within each region by functional area or by frequency/risk of change – so that the potential impact of a particular change is reduced, and so Terraform does not have to resynchronize the entire space of objects on every change.

(Performance and “blast radius” considerations aside, there is also the concern of limiting the total dependency space of a particular configuration so that you don’t limit your ability to respond with configuration changes during partial outages. If your single Terraform configuration covers all of your regions and one region has an outage, you’d need to run Terraform in an unusual way to avoid work to reconfigure other regions being blocked by the outage. This is a reason for the first levels of decomposition to align with your system’s failure domains.)

I would typically use something like your region module as the first level of decomposition, either by writing a separate root module for each region that all call into that shared module, or using the region module itself as the root and using workspaces in its backend to keep each region’s state separated. For AWS in particular, the former is usually preferable from a failure domain standpoint, because otherwise all of your state storage will be colocated in a single S3 bucket in a single AWS region. Different tradeoffs can apply to other platforms.

There are also often some extra objects that don’t belong to just one region. It looks like your “GlobalAccellerator” objects are examples of that. For these, it’s common to have a separate “global” configuration that can be applied once all of the other regions are active in order to produce whatever global objects are needed to make the parts appear as a single system. The global configuration can use data sources to retrieve information about the region-specific objects as necessary to complete the global object configurations. The global configuration will be spanning across multiple failure domains of course, so it’s best to keep it as small as possible within other constraints.

I hope that helps!

2 Likes

Excellent info, thanks for all the tips, I will have to subdivide this further. Would you recommend creating 1 Terraform “project” or configuration per AZ even? Or is per-region good enough?

Thanks again!

Hi @lancejpollard,

Sorry I didn’t reply sooner. I’ve been taking some personal time for the last couple of weeks.

The question of per-region vs per-availability zone is a good one, and not one I can answer definitively – the answer will depend on what tradeoffs you want to make – but I will share how I made this decision some years ago when I was implementing Terraform for my former employer, and you can decide if this same thought process applies to your situation:

For AWS, each region has an entirely separate set of API endpoints, and is therefore largely independent of the other regions with the exception of some global concepts like IAM. Therefore regions seem like the primary failure domains in AWS.

Availability zones are a second level of failure domain that is exposed for object types that interact with EC2/VPC, but that separation applies primarily to the behavior of objects already running; the AWS APIs that Terraform interacts with are not explicitly separated per availability zone, and so the handling of a single-AZ outage situation is an implementation detail of AWS rather than something we can directly control as users of the AWS APIs.

For that reason, I elected to use region as the first level of decomposition and the primary failure domain, but then the second level of decomposition was by functional area rather than by failure domain, because I expected that in the event of a whole-AZ outage the failure behavior would be undefined and not so something I’d be able to predict and model in Terraform architecture anyway; it would be additional complexity with no clear benefit.