Hello, I’m looking for some advice and general best practices on how to speed up the deployment (and eventual destruction) of 4000+ EC2 instances. I ran into a couple of different issues, so I’ll mention them all here, and perhaps separate into multiple topics or tickets later, if needed.
Here’s what the terraform code looks like:
resource "aws_instance" "my_endpoints" {
...
count = var.instance_count
...
timeouts {
create = "60m" # default 10m times out on 160 instances
delete = "60m" # default 20m is OK but increasing just in case
}
}
And my version information:
$ terraform --version
Terraform v0.15.4
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v3.41.0
Provisioning Parallelism
The first issue I ran into was that even after increasing the create timeout to several hours, it would still timeout when count=4000. I ended up calling terraform multiple times with count=500 until I reached 4000. That works fine, but it takes about 6 hours.
I’m running terraform on a system with 4 cores and 8 GB of RAM. Do you think increasing the parallelism would help? If it would not help due to only having 4 cores, is there a recommended cores-to-parallelism mapping that I could consult? For example, if I wanted to reduce the total time to 1 hour, could I multiply the default paralellism 10x6=60 and would that require increasing to 4x6=24 cores?
Skip State Refresh?
When I went to destroy the 4000 instances, “terraform destroy” automatically refreshes the state. This took about an hour. Is it possible to ask terraform to skip the refresh and just do the delete?
Destroy Cancelled!
After the refresh finally finished, I got the prompt to enter “yes” and I believe I entered exactly “yes” but it came back and said the destroy was cancelled. I did some experimentation and found that leading whitespace (before “yes”) is not stripped and will be invalid. Trailing whitespace is ok and will be ignored. Thus, " yes" is bad but "yes " is fine. In either case, I don’t believe I had any whitespace, but it still cancelled my operation (which was devastating since trying again will introduce another 1-hour delay while refreshing state). I could add -auto-approve next time, but has anyone heard of this problem?
...
Plan: 0 to add, 0 to change, 4001 to destroy.
Do you really want to destroy all resources?
Terraform will destroy all your managed infrastructure, as shown above.
There is no undo. Only 'yes' will be accepted to confirm.
Enter a value: yes
Destroy cancelled.
Error: InvalidVolume.NotFound
4000 instances is of course pretty expensive, so I didn’t want to leave them running longer than necessary. After the destroy was cancelled, I re-ran “terraform destroy” with auto approve, but also started deleting batches of 50 from the AWS web interface. I figured I’d destroy as much as I could manually while the state refresh was occurring. However, the state refresh failed this time with:
...
aws_instance.my_endpoints[3356]: Refreshing state... [id=i-066973663064bd9f6]
aws_instance.my_endpoints[1230]: Refreshing state... [id=i-07774eb5e583493ed]
│ Error: InvalidVolume.NotFound: The volume 'vol-0038ca962fb0ef4b5' does not exist.
│ status code: 400, request id: 0a9c78fb-46dd-4623-8dab-13dacadcc362
│
│ with aws_instance.my_endpoints[1387],
│ on main.tf line 51, in resource "aws_instance" "my_endpoints":
│ 51: resource "aws_instance" "my_endpoints" {
This was a fatal error for terraform, so I had to continue manually deleting the 4000 instances on batches of 50 from the web interface. Is there a way to tell terraform to ignore those types of errors?
Terraform vs AWS CLI?
I would appreciate any other best practices or tips the community can provide on speeding up these operations. In general, do you think I’d be better off scripting with awscli for this type of work? I don’t need to manage complex infrastructure over time or share code with others…I just need to spin up a lot of instances, use them for a short amount of time, and then destroy them. Is terraform the right tool for that job?