We are using Gitlab pipeline to trigger our Terraform init, plan and apply to create multiple resources on AWS. Also we are using S3 as remote backend to store the terraform state file. Normally everything works fine.
But in case terraform runner is killed or crashed in between running terraform apply, we find several resources in AWS created but few not created. But we also do not find the terraform state file updated on S3. This is causing mismatch of state file and resources created.
Can you please guide on how to deal with such scenario.
If a run doesn’t correctly complete you indeed may find that the state file & reality are out of sync. You will need to manually investigate what is missing/extra and then use
terraform import and
terraform state rm and/or remove resources manually from AWS to get things back in line.
Thanks stuart-c for your reply.
Actually we are looking out for solution in which terraform state file can be restored if runner is crashed. or all resources created by pipeline get destroyed till it get full success.
Any ideas are welcome for this solution?
The issue is that nobody knows what needs doing without investigation.
The issue is that API calls have been sent to create, delete or modify resources, but before the state file is updated the Terraform program is terminated. Therefore there is no record of those new resources having been created.
You could revert the state file to a previous version (assuming you are using something like a S3 bucket with versioning) but you still wouldn’t know about those resources which are in state but removed or created by not recorded.
In general it needs some level of inteligence (i.e. not something that is easy to do programatically) to figure out which resources map to what inside Terraform.
For example before you run Terraform there are 10 EC2 instances all of the same size. You run Terraform and it starts creating some new instances, but it is terminated before the state file is updated. There are now 20 instances. Some of those could be from API calls Terraform sent before being aborted, but how do you know which instances maps to which Terraform resource? There might also be instances created by an autoscaling group (and therefore not something you should map to Terraform state), things created by a separate Terraform root module, or things which are managed totally outside of Terraform. You need to look at those instances and the code and try to figure it all out - not something that can be done easily by a generic program.
I was just visualizing to use PVC with gitlab runner so in case gitlab runner is crashed, new runner can pick the state file stored on PVC. Then terraform can pick it up to create new resources and we have persistent state file in runner.
Does this make sense? What do you think of this?
The state is stored in memory before being sent to the remote state location (assuming you are using a remote backend). So it wouldn’t help as nothing has changed on disk.