Provisioning Terraform via ssh can result in corrupted files if there is an interruption. Backup not working

With gitlab ci, ,my tests will use a bash script to ssh into a vagrant vm that is responsible for running a terraform deployment.

In the event a test must be cancelled from the Gitlab CI UI, this almost always results in a corrupted tfstate file. This is because the ssh connection is killed almost immediately.

I’ve also found the backup isn’t always usable in this scenario either, and its not something I’d like to rely on.

What other alternatives are there? Possible paths of enquiry I’ve had are:

  1. Is it possible to get terraform to automatically import all resources that match a tag? That way I could recover from a new state perhaps.
  2. Or could terraform actions be deployed via a service that would continue or stop the process nicely if the ssh connection ended?
  3. is there some other way to repair a corrupted file, or more aggresively backup / checkpoint a file? if a tfstate file gets busted, it needs to get repaired automtically somehow.

I’m getting orphaned resources daily because of ssh to the vm running TF being killed.

Could terraform update its state file in a more robust way?

  1. A new write to update tfstate would instead actually go to a temp file A,
  2. The tfstate file is moved to a backup temp file B.
  3. Temp file A (current state) is then moved to replace the .tfstate file.
  4. Temp file B is moved to replace the tfstate.backup.

This way, an interruption at any of these points would still mean we have a means to recover, and hopefully automatically without any intervention (not currently the case).

Could this approach reduce the likely hood of a corrupted tfstate file?
If not, could an improvement be made with a service to do a similar job?

The handling could also be performed by a service perhaps, that way an ssh kill would still be recoverable.