From your second comment it sounds like you are using local state files. That mode of operation is primarily for local development and so it doesn’t try to optimize for robustness in a hostile environment where Terraform can potentially be terminated without warning.
You might have better luck with a remote state backend, because for those the state snapshots are typically written atomically so that they are either updated in entirety or not updated at all. (Details do vary by backend though, because they are generally subject to the behavior of whatever data store they are using.)
Even with atomic state updates you’ll still have to contend with the fact that interrupting Terraform during a
terraform apply without giving it an opportunity to exit gracefully creates the very likely risk that there will be actions that have been taken by Terraform whose results are not yet committed to remote state.
If you send Terraform
SIGINT then it will flush a snapshot of the current state to the backend and signal all open providers to cancel their current operations if possible, and then wait for all of the in-progress provider operations to complete before exiting.
Unfortunately this does then leave Terraform at the mercy of the providers themselves, which are in turn often limited by the capabilities of underlying APIs: most REST APIs have no first-class cancellation mechanism, and so just abruptly terminating a write request means that it’s undefined whether the write eventually completed or not. Because of this, Terraform providers typically take a conservative approach and don’t support cancellation at all, so that they can be sure to see the results of any pending requests and have them committed to a new state snapshot before Terraform finally exits.
The upshot of all of this is that running Terraform in an environment where it will be routinely terminated without an opportunity to gracefully exit is not really practical. Even if you address your problem with corrupted local state snapshots, there are other problems awaiting you downstream.
My first idea here would be to try to arrange for Terraform to be run in a different way that will allow it to terminate gracefully except in the most rare/catastrophic situations (e.g. power failure, kernel panic). I’m not familiar enough with GitLab CI to give specific ideas here, but as you mentioned perhaps it might look something like running Terraform indirectly via a service that runs in the background and can outlive a specific CI job. For example, running Terraform remotely in Terraform Cloud and then politely polling the Terraform Cloud API for the success/fail outcome would make it Terraform Cloud’s responsibility to terminate Terraform gracefully if the job gets terminated early. If a cloud thing isn’t appropriate, you could potentially implement a similar service for running Terraform internally.
If running Terraform in this hostile environment is unavoidable, I think as you said you will probably need to circumvent Terraform’s usual state tracking and do something custom instead. A script that queries the remote APIs and runs a series of
terraform import commands is one possibility.