Hi, everyone
I have an AWS infrastructure defined by Terraform. However, when the tags or image of its Batch compute environment get updated, Terraform replaces (i.e. destroys and recreates) the compute environment, which causes downtime in Batch. Details:
Context
-
An AWS infrastructure defined by Terraform.
-
The infrastructure includes Batch compute environment and job queue that is mapped to the compute environment.
-
The default tags are applied to both compute environment and job queue.
-
The image of the compute environment is set to a specific AMI defined in tfvars.json file.
Problem
-
When tags or AMI of the compute environment get updated, Terraform destroys and recreates the compute environment.
-
When Terraform destroys the compute environment, it needs to be detached from a job queue. Thus, today, we destroy the entire job queue prior to
terraform applyif the compute environment needs to be replaced. -
This causes Batch downtime, which is critical since there are scheduled jobs to be run in Batch.
Idea: Have two pairs of Batch compute environment and job queue (blue-green deployment)
- Description: During an upgrade of the infrastructure, have Terraform only update one pair, while the other pair handles scheduled Batch jobs
- Limitation: This requires
lifecycle.ignore_changesattribute ofaws_batch_compute_environmentto be set dynamically (i.e. when updating the blue pair, setlifecycle.ignore_changesattribute of green compute environment toall). However,lifecycle.ignore_changesonly takes static expressions. Otherwise, Terraform fails withA static list expression is required.error
How can we prevent Batch downtime when the Batch compute environment and job queue need to be updated with new tags and the AMI of the compute environment needs to be updated? Any suggestion/idea would be appreciated. Thanks!