Provisioning Terraform via ssh can result in corrupted files if there is an interruption. Backup not working

With gitlab ci, ,my tests will use a bash script to ssh into a vagrant vm that is responsible for running a terraform deployment.

In the event a test must be cancelled from the Gitlab CI UI, this almost always results in a corrupted tfstate file. This is because the ssh connection is killed almost immediately.

I’ve also found the backup isn’t always usable in this scenario either, and its not something I’d like to rely on.

What other alternatives are there? Possible paths of enquiry I’ve had are:

  1. Is it possible to get terraform to automatically import all resources that match a tag? That way I could recover from a new state perhaps.
  2. Or could terraform actions be deployed via a service that would continue or stop the process nicely if the ssh connection ended?
  3. is there some other way to repair a corrupted file, or more aggresively backup / checkpoint a file? if a tfstate file gets busted, it needs to get repaired automtically somehow.

I’m getting orphaned resources daily because of ssh to the vm running TF being killed.

Could terraform update its state file in a more robust way?
eg,

  1. A new write to update tfstate would instead actually go to a temp file A,
  2. The tfstate file is moved to a backup temp file B.
  3. Temp file A (current state) is then moved to replace the .tfstate file.
  4. Temp file B is moved to replace the tfstate.backup.

This way, an interruption at any of these points would still mean we have a means to recover, and hopefully automatically without any intervention (not currently the case).

Could this approach reduce the likely hood of a corrupted tfstate file?
If not, could an improvement be made with a service to do a similar job?

The handling could also be performed by a service perhaps, that way an ssh kill would still be recoverable.

I’m still very much stumped by this. Anytime I have to cancel a job running in gitlab it will result in a corrupted Terraform state file. Its painful because I can’t cancel a shell job in gitlab and I don’t know how it can be done, so all my jobs just have to run to the end even when I’ve made a mistake.

My understanding is that by using a shell script that will ssh into a vagrant host, which then executes a terraform deploy will circumvent’s terraforms ability to capture ctrl + c (or SIGTERM?), because a SIGTERM or ctrl+c on the shell script prevents terraform from exiting gracefully.

Are there any clues I could use to trap the SIGTERM somehow and still allow terraform to exit gracefully?

This article below was educational:
https://linuxconfig.org/how-to-modify-scripts-behavior-on-signals-using-bash-traps
But it still looks like any trap operation will forcefully interrupt a terraform apply…

Perhaps I need to background the terraform apply somehow, trap a SIGTERM on the shell script, and send SIGTERM to the terraform PID? If thats a plan I’m not sure what it would look like.

Hi @queglay,

From your second comment it sounds like you are using local state files. That mode of operation is primarily for local development and so it doesn’t try to optimize for robustness in a hostile environment where Terraform can potentially be terminated without warning.

You might have better luck with a remote state backend, because for those the state snapshots are typically written atomically so that they are either updated in entirety or not updated at all. (Details do vary by backend though, because they are generally subject to the behavior of whatever data store they are using.)

Even with atomic state updates you’ll still have to contend with the fact that interrupting Terraform during a terraform apply without giving it an opportunity to exit gracefully creates the very likely risk that there will be actions that have been taken by Terraform whose results are not yet committed to remote state.

If you send Terraform SIGINT then it will flush a snapshot of the current state to the backend and signal all open providers to cancel their current operations if possible, and then wait for all of the in-progress provider operations to complete before exiting.

Unfortunately this does then leave Terraform at the mercy of the providers themselves, which are in turn often limited by the capabilities of underlying APIs: most REST APIs have no first-class cancellation mechanism, and so just abruptly terminating a write request means that it’s undefined whether the write eventually completed or not. Because of this, Terraform providers typically take a conservative approach and don’t support cancellation at all, so that they can be sure to see the results of any pending requests and have them committed to a new state snapshot before Terraform finally exits.

The upshot of all of this is that running Terraform in an environment where it will be routinely terminated without an opportunity to gracefully exit is not really practical. Even if you address your problem with corrupted local state snapshots, there are other problems awaiting you downstream. :confounded:

My first idea here would be to try to arrange for Terraform to be run in a different way that will allow it to terminate gracefully except in the most rare/catastrophic situations (e.g. power failure, kernel panic). I’m not familiar enough with GitLab CI to give specific ideas here, but as you mentioned perhaps it might look something like running Terraform indirectly via a service that runs in the background and can outlive a specific CI job. For example, running Terraform remotely in Terraform Cloud and then politely polling the Terraform Cloud API for the success/fail outcome would make it Terraform Cloud’s responsibility to terminate Terraform gracefully if the job gets terminated early. If a cloud thing isn’t appropriate, you could potentially implement a similar service for running Terraform internally.

If running Terraform in this hostile environment is unavoidable, I think as you said you will probably need to circumvent Terraform’s usual state tracking and do something custom instead. A script that queries the remote APIs and runs a series of terraform import commands is one possibility.

Thanks so much for your detailed answer @apparentlymart. I don’t know how others are handling this or why it isn’t such a common issue, I thought gitlab CI and terraform played well together, but it looks like shell runners aren’t great here. Terraform cloud isn’t an option because I have to run as much as possible on private hardware.

I’ll have a think about my options, it does look like I will try and background the run, trap the signal and attempt to more politely exit that way.

If we have an OSS vault working, is it possible to persist terraform remote state to Vault?

Since this post I’ve migrated to using a deployer ec2 instance (with aws code pipeline and code deploy) and configure remote state using terragrunt.

Now the problem is different - If I force stop a terraform apply either through the code depploy console, or if I run the terrform apply through a bash script and send ctrl-c (SIGINT), I end up with terraform locks that dont get released. This prevents any further automation since the locks have to be manually released.

How can we then interrupt the terraform process and ensure the locks get released? AFAICT, at the end of any TF run, all locks should get released that were taken by the current run, this shouldn’t be hard to automate, yet the solution is not clear to me.

Yes if you just forcefully quit an instance of Terraform while it is the middle of applying changes it won’t have chance to remove any locks. More importantly it probably won’t have updated the state file correctly, meaning the next time Terraform is run you might need to do some manual work (e.g. if Terraform has started creating something that doesn’t end up in the state you could end up making another one during the next run - which would either fail as a duplicate or result in multiple instances).

If you more gracefully try to stop an apply it will keep going until the currently in progress changes are completed and release the locks afterwards.

I’d never suggest trying to automatically remove the stuck locks. If they are still there because of a forcefully quit run you really need to do a quick check to ensure nothing manual is needed (e.g. a state rm or import). The lock might also be there due to another run being in progress (e.g. someone manually running plan on their machine).

What options are there to gracefully stop the run?

From a technical perspective it is the difference between a SIGTERM & SIGKILL.

For Terraform in particular there is a special handling of SIGINT where the first such signal will cause Terraform to enter a shutdown mode where it will wait for current actions to complete but not start any new ones, whereas a second SIGINT will cause Terraform to immediately abort, interrupting whatever is going on without any attempts to release anything.

For graceful shutdown then, the easiest answer is to hit Ctrl+C only once and then let Terraform run to completion. However, that of course won’t help you if the reason you wanted to abort is because the current operation is stuck since then it cannot gracefully exit. In that case you can at least wait a little while after the first SIGINT to give other concurrent actions a chance to run to completion before you forcefully abort, but you will still need to then manually clean up any leftover mess and force remove any active locks.

I’m a bit late to the discussion here, but I’ll address your point #2 which I haven’t seen anyone else do. Yes, there are several ways of letting terraform finish doing its thing even if the ssh connection goes down. If the connection were interactive, one obvious answer here would be to run terraform under tmux, and indeed tmux can also be used in scripts, but it’s a bit of an effort.

But unix also has an easier way… the ‘nohup’ command exists to deal exactly with this problem; use it like this:

$ nohup terraform apply -auto-approve > tf.out

This detaches the process from the TTY and sends the output to stdout, which the above redirects to a file (but if you don’t care about the output, you don’t need to do that).

You can test this by running it in a terminal window and closing the window while it’s running… the process won’t abort and will complete. Same if you escape out of an ssh connection.

1 Like