Provisioning Terraform via ssh can result in corrupted files if there is an interruption. Backup not working

With gitlab ci, ,my tests will use a bash script to ssh into a vagrant vm that is responsible for running a terraform deployment.

In the event a test must be cancelled from the Gitlab CI UI, this almost always results in a corrupted tfstate file. This is because the ssh connection is killed almost immediately.

I’ve also found the backup isn’t always usable in this scenario either, and its not something I’d like to rely on.

What other alternatives are there? Possible paths of enquiry I’ve had are:

  1. Is it possible to get terraform to automatically import all resources that match a tag? That way I could recover from a new state perhaps.
  2. Or could terraform actions be deployed via a service that would continue or stop the process nicely if the ssh connection ended?
  3. is there some other way to repair a corrupted file, or more aggresively backup / checkpoint a file? if a tfstate file gets busted, it needs to get repaired automtically somehow.

I’m getting orphaned resources daily because of ssh to the vm running TF being killed.

Could terraform update its state file in a more robust way?

  1. A new write to update tfstate would instead actually go to a temp file A,
  2. The tfstate file is moved to a backup temp file B.
  3. Temp file A (current state) is then moved to replace the .tfstate file.
  4. Temp file B is moved to replace the tfstate.backup.

This way, an interruption at any of these points would still mean we have a means to recover, and hopefully automatically without any intervention (not currently the case).

Could this approach reduce the likely hood of a corrupted tfstate file?
If not, could an improvement be made with a service to do a similar job?

The handling could also be performed by a service perhaps, that way an ssh kill would still be recoverable.

I’m still very much stumped by this. Anytime I have to cancel a job running in gitlab it will result in a corrupted Terraform state file. Its painful because I can’t cancel a shell job in gitlab and I don’t know how it can be done, so all my jobs just have to run to the end even when I’ve made a mistake.

My understanding is that by using a shell script that will ssh into a vagrant host, which then executes a terraform deploy will circumvent’s terraforms ability to capture ctrl + c (or SIGTERM?), because a SIGTERM or ctrl+c on the shell script prevents terraform from exiting gracefully.

Are there any clues I could use to trap the SIGTERM somehow and still allow terraform to exit gracefully?

This article below was educational:
But it still looks like any trap operation will forcefully interrupt a terraform apply…

Perhaps I need to background the terraform apply somehow, trap a SIGTERM on the shell script, and send SIGTERM to the terraform PID? If thats a plan I’m not sure what it would look like.

Hi @queglay,

From your second comment it sounds like you are using local state files. That mode of operation is primarily for local development and so it doesn’t try to optimize for robustness in a hostile environment where Terraform can potentially be terminated without warning.

You might have better luck with a remote state backend, because for those the state snapshots are typically written atomically so that they are either updated in entirety or not updated at all. (Details do vary by backend though, because they are generally subject to the behavior of whatever data store they are using.)

Even with atomic state updates you’ll still have to contend with the fact that interrupting Terraform during a terraform apply without giving it an opportunity to exit gracefully creates the very likely risk that there will be actions that have been taken by Terraform whose results are not yet committed to remote state.

If you send Terraform SIGINT then it will flush a snapshot of the current state to the backend and signal all open providers to cancel their current operations if possible, and then wait for all of the in-progress provider operations to complete before exiting.

Unfortunately this does then leave Terraform at the mercy of the providers themselves, which are in turn often limited by the capabilities of underlying APIs: most REST APIs have no first-class cancellation mechanism, and so just abruptly terminating a write request means that it’s undefined whether the write eventually completed or not. Because of this, Terraform providers typically take a conservative approach and don’t support cancellation at all, so that they can be sure to see the results of any pending requests and have them committed to a new state snapshot before Terraform finally exits.

The upshot of all of this is that running Terraform in an environment where it will be routinely terminated without an opportunity to gracefully exit is not really practical. Even if you address your problem with corrupted local state snapshots, there are other problems awaiting you downstream. :confounded:

My first idea here would be to try to arrange for Terraform to be run in a different way that will allow it to terminate gracefully except in the most rare/catastrophic situations (e.g. power failure, kernel panic). I’m not familiar enough with GitLab CI to give specific ideas here, but as you mentioned perhaps it might look something like running Terraform indirectly via a service that runs in the background and can outlive a specific CI job. For example, running Terraform remotely in Terraform Cloud and then politely polling the Terraform Cloud API for the success/fail outcome would make it Terraform Cloud’s responsibility to terminate Terraform gracefully if the job gets terminated early. If a cloud thing isn’t appropriate, you could potentially implement a similar service for running Terraform internally.

If running Terraform in this hostile environment is unavoidable, I think as you said you will probably need to circumvent Terraform’s usual state tracking and do something custom instead. A script that queries the remote APIs and runs a series of terraform import commands is one possibility.

Thanks so much for your detailed answer @apparentlymart. I don’t know how others are handling this or why it isn’t such a common issue, I thought gitlab CI and terraform played well together, but it looks like shell runners aren’t great here. Terraform cloud isn’t an option because I have to run as much as possible on private hardware.

I’ll have a think about my options, it does look like I will try and background the run, trap the signal and attempt to more politely exit that way.

If we have an OSS vault working, is it possible to persist terraform remote state to Vault?