Optimizing terraform + aws to deploy and destroy 4000+ instances

Hello, I’m looking for some advice and general best practices on how to speed up the deployment (and eventual destruction) of 4000+ EC2 instances. I ran into a couple of different issues, so I’ll mention them all here, and perhaps separate into multiple topics or tickets later, if needed.

Here’s what the terraform code looks like:

resource "aws_instance" "my_endpoints" {
    ...
    count = var.instance_count
    ...
    timeouts {
    create = "60m" # default 10m times out on 160 instances
    delete = "60m" # default 20m is OK but increasing just in case
  }
}

And my version information:

$ terraform --version
Terraform v0.15.4
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v3.41.0

Provisioning Parallelism

The first issue I ran into was that even after increasing the create timeout to several hours, it would still timeout when count=4000. I ended up calling terraform multiple times with count=500 until I reached 4000. That works fine, but it takes about 6 hours.

I’m running terraform on a system with 4 cores and 8 GB of RAM. Do you think increasing the parallelism would help? If it would not help due to only having 4 cores, is there a recommended cores-to-parallelism mapping that I could consult? For example, if I wanted to reduce the total time to 1 hour, could I multiply the default paralellism 10x6=60 and would that require increasing to 4x6=24 cores?

Skip State Refresh?

When I went to destroy the 4000 instances, “terraform destroy” automatically refreshes the state. This took about an hour. Is it possible to ask terraform to skip the refresh and just do the delete?

Destroy Cancelled!

After the refresh finally finished, I got the prompt to enter “yes” and I believe I entered exactly “yes” but it came back and said the destroy was cancelled. I did some experimentation and found that leading whitespace (before “yes”) is not stripped and will be invalid. Trailing whitespace is ok and will be ignored. Thus, " yes" is bad but "yes " is fine. In either case, I don’t believe I had any whitespace, but it still cancelled my operation (which was devastating since trying again will introduce another 1-hour delay while refreshing state). I could add -auto-approve next time, but has anyone heard of this problem?

...
Plan: 0 to add, 0 to change, 4001 to destroy.

Do you really want to destroy all resources?
  Terraform will destroy all your managed infrastructure, as shown above.
  There is no undo. Only 'yes' will be accepted to confirm.

  Enter a value: yes

Destroy cancelled.

Error: InvalidVolume.NotFound

4000 instances is of course pretty expensive, so I didn’t want to leave them running longer than necessary. After the destroy was cancelled, I re-ran “terraform destroy” with auto approve, but also started deleting batches of 50 from the AWS web interface. I figured I’d destroy as much as I could manually while the state refresh was occurring. However, the state refresh failed this time with:

...
aws_instance.my_endpoints[3356]: Refreshing state... [id=i-066973663064bd9f6]
aws_instance.my_endpoints[1230]: Refreshing state... [id=i-07774eb5e583493ed]
│ Error: InvalidVolume.NotFound: The volume 'vol-0038ca962fb0ef4b5' does not exist.
│       status code: 400, request id: 0a9c78fb-46dd-4623-8dab-13dacadcc362
│ 
│   with aws_instance.my_endpoints[1387],
│   on main.tf line 51, in resource "aws_instance" "my_endpoints":
│   51: resource "aws_instance" "my_endpoints" {

This was a fatal error for terraform, so I had to continue manually deleting the 4000 instances on batches of 50 from the web interface. Is there a way to tell terraform to ignore those types of errors?

Terraform vs AWS CLI?

I would appreciate any other best practices or tips the community can provide on speeding up these operations. In general, do you think I’d be better off scripting with awscli for this type of work? I don’t need to manage complex infrastructure over time or share code with others…I just need to spin up a lot of instances, use them for a short amount of time, and then destroy them. Is terraform the right tool for that job?

Hi @iMHLv2,

I think any configuration involving thousands of objects is going to run into significant overhead making thousands of API requests, but as you say it’s probably the refresh step that’s the worst offender here, since Terraform normally needs to do that regardless of whether you’ve changed anything in the configuration or not (to see if there’s any drift).

Fortunately Terraform does offer a way to disable that step: you can run terraform plan -refresh=false or terraform apply -refresh=false to disable the refresh step, telling Terraform to just trust its own records of what all of these objects looked like at the end of the most recent terraform apply. Of course, in that case it won’t detect any drift, but fortunately for EC2 instances there isn’t a whole lot that can drift anyway, because most settings are immutable once an instance is running.

Multiple instances of the same resource can never depend on one another (because dependencies are between resources, not between resource instances), and so increasing the parallelism should indeed give Terraform the opportunity to deal with more of these at the same time during the initial create, and also during refresh if you don’t turn that off. At some point you may hit some sort of limit in the remote API for maximum number of outstanding requests, but I’m not sure what exactly the rules are for the EC2 API.

Thanks a lot @apparentlymart - I didn’t know about refresh=false - it seems to work with destroying too. The parallelism is making things better as well. It looks like somewhere between 50 and 100 is my sweet spot, when experimenting with a batch of 100 instances.

$ time terraform apply -refresh=false -var="instance_count=100" -parallelism=10
real	9m24.612s

$ time terraform apply -refresh=false -var="instance_count=100" -parallelism=50
real	1m56.143s

$ time terraform apply -refresh=false -var="instance_count=100" -parallelism=100
real	3m14.839s

And for destroy:

$ time terraform destroy -refresh=false -parallelism=10
real	10m10.371s

$ time terraform destroy -refresh=false -parallelism=50
real	2m7.008s

$ time terraform destroy -refresh=false -parallelism=100
real	2m27.538s

When creating instances, I got “Error waiting for instance (i-002ddf72d4e6388c4) to become ready: Failed to reach target state. Reason: Server.InternalError: Internal error on launch” on two different tests, which caused all but 1 of the 100 instances to be created each time. So parallelism may affect the timeouts, because that never happened before.

Hi @iMHLv2,

I’m not expert on the internals of the AWS provider, but the error you mentioned seems like it was passing on an explicit error report from the remote API rather than just a timeout… “Server.InternalError” looks like the typical way the EC2 API formats its own server-side errors.

I don’t know what that error means – seems like a generic error that is perhaps reporting somethign they don’t have an explicit check for, kinda like a 500 Internal Server Error in REST-style APIs – but I doubt it’s directly related to the timeout settings you chose, because those would typically be handled client-side within the provider code itself.