Graceful saving of tf state file to avoid drift with the real environment - Gitlab

bxkrish · November 2, 2023, 6:22pm

Hi Team,

We have our state file managed on GitLab.

Our timeout for the CI/CD Pipeline Runner on GitLab is 2 hours.

If the timeout is reached and terraform is still running “terraform apply” then there is a drift between the State file and the Real environment.

Is there a way we can setup graceful exit of “terraform apply” (may be at 1 hour 45 minutes) then save the State file and gracefully exit and not wait until the runner times out?

apparentlymart · November 2, 2023, 10:41pm

Hi @bxkrish,

I’m not familiar with GitLab’s execution environment and so I can only offer some general information about how Terraform itself behaves.

If Terraform receives the signal SIGINT then it will begin a graceful shutdown process which will:

attempt to persist a snapshot of the current interim state to the configured storage
ask all providers that are currently performing operations to stop as soon as they safely can, possibly returning an error
try to persist additional state snapshots for every provider operation that completes promptly after it completes, to minimize the amount of loss if the process is then terminated more aggressively.

Once Terraform enters this “graceful abort” state, any further signals will be taken as a request to abort abruptly, without waiting for anything to complete.

With all of that said then, perhaps you can find some way to introduce an additional early timeout which signals Terraform as described above, and send that signal early enough that Terraform will have some time to complete its graceful abort before reaching the two hour timeout.

zhenrong-wang · November 3, 2023, 5:10am

How about saving the tfstate file before terminating the terraform process?

If terraform is terminated by your CI/CD automation process when hit the timeout, you can just copy the tfstate file to a temporary or permenant directory before terminating it.

Add a system call “cp -r terraform.tfstate /YOUR_DIRECTORY” before killing terraform process may help. However, drift may also occur because the terraform process gets interrupted.

bxkrish · November 7, 2023, 2:25am

thank you for your suggestion. Would SIGKILL work the same?
I am thinking to send “KILL -9 PID” at around 100th minute giving 20 minutes to “terraform apply” to process the SIGKILL and save the STATE. My timeout is set to 120 minutes.

apparentlymart · November 7, 2023, 3:58pm

The “kill” signal is handled by the kernel rather than Terraform and so does not give Terraform any opportunity to perform any shutdown actions.

The “int” signal is short for “interrupt”, and that is the one that gives Terraform an opportunity to perform any needed actions to try to shut down early without data loss.

bxkrish · November 7, 2023, 6:02pm

Thanks for confirming the SIGKILL will not work in this case.
“man 7 signal” gives me information about different types of Signals out there and this is what I see.

SIGINT P1990 Term Interrupt from keyboard

If SIGINT is issued via keyboard manually, how can I programmatically issue this signal?

Any examples you have?

Edit:
I think I can use this command.

kill -SIGINT -PID

apparentlymart · November 8, 2023, 10:26pm

Yes, the kill command is a reasonable way to send arbitrary signals to a process, including the SIGINT signal.

That command would call the same Linux kernel function as your shell’s job control features would use to respond to you pressing Ctrl+C; Terraform won’t be able to tell the difference between these two as long as you make sure you’re sending the interrupt signal in particular.

bxkrish · November 9, 2023, 9:34am

I have hit another blocker.

My terraform fmt/validate/plan/apply all are running as a gitlab pipeline under different stages inside a docker container.

Now as shown in above screenshot, I get PID as 1.

According to this documentation

A process running as PID 1 inside a container is treated specially by Linux: it ignores any signal with the default action. As a result, the process will not terminate on SIGINT or SIGTERM unless it is coded to do so.

Have you faced similar situation?

apparentlymart · November 9, 2023, 3:07pm

Hi @bxkrish,

I’m not familiar with that specific situation – I typically run Terraform as a child process of a shell or other supervisor program – but based on the wording of that documentation I would expect this to work anyway.

The important part is the last statement “unless it is coded to do so”. The statement isn’t clear exactly what that means, but the previous part talks about the “default action” for a signal, which is a mechanism that allows the kernel to decide what to do with a signal when the recipient process hasn’t specified any specific code to run when that signal arrives.

Terraform does register some code to run when the signal arrives – the code that triggers the graceful shutdown behavior – and so I don’t think this statement applies to Terraform. The kernel should run Terraform’s handler for that signal regardless of which PID Terraform was assigned.

bxkrish · November 9, 2023, 8:00pm

Thank you. hope there is some sort of “trap” in the terraform code to catch the signals.

As I was trying my script, I found that in gitlab pipeline script, we cannot run tasks in the background. Like nohup or &.

So if I have script section in a Job and I want to run “terraform apply” in the background so that my script can progress to the next line with “if condition”.

If I cannot do that then “terraform apply” cannot be interrupted.

I hope there was some timeout parameter that we could pass to “terraform apply” as an optional parameter.

sample script

script:

end=6000

pid=$$

echo “$pid”

terraform apply

|-
if [ “$SECONDS” > “$end” ]; then
echo “inside if loop $SECONDS and $end”
kill -SIGINT -$pid
fi

apparentlymart · November 9, 2023, 9:57pm

Hi @bxkrish,

I would typically expect this sort of interrupt behavior to be provided by the job system itself, rather than something you would need to implement as part of the job.

I’m not familiar with GitLab and so I can’t promise it has such a feature, but I would find it surprising if GitLab doesn’t offer something for this, because just immediately sending SIGKILL to a process without interrupting it first and giving it a chance to terminate itself would be far too strict for any software that has any kind of external state. I would suggest asking GitLab how you can configure their system to terminate jobs gracefully, as a first preference.

If you do find that GitLab’s execution environment is lacking such a feature, and you have no option of using any other execution platform, then my backup suggestion would be to run Terraform as a child process of a supervisor program that runs in the foreground.

The basic structure of such a program would be:

Call the alarm system call to arrange for the kernel to send a signal after a given number of seconds.
Use the fork+exec system calls (or something equivalent) to launch Terraform as a child process.
Register a SIGALRM signal hander, which will run if the process runs long enough to get that alarm signal. That signal handler should send SIGINT to the Terraform child process, to tell it to start shutting down.
Use waitpid to block until the Terraform child process exits.
Use the exit code from Terraform as the exit code from the supervisor program so that the GitLab environment can still react to it.

As I mentioned above, I would expect any job execution system like GitLab’s to offer equivalent functionality itself anyway – this is the basic function of a job executor – but if GitLab is lacking this fundamental feature for some reason then it’s possible to implement the same thing yourself as a child process of theirs, but you will probably need to do it in a language other than shell scripting because shells are not really designed for this sort of thing.

Topic		Replies	Views
Provisioning Terraform via ssh can result in corrupted files if there is an interruption. Backup not working Terraform	11	2002	May 16, 2022
TF Apply - stops checking for the resource state after 3 minutes Terraform	0	394	July 1, 2020
Timeout setting for terraform binary Terraform	1	2427	September 17, 2021
How to integrate Terraform Cloud in your CI pipeline HCP Terraform	4	1693	April 22, 2020
Terraform state issue when "terraform apply" errors out in middle of the run Terraform	2	2164	September 19, 2022

Graceful saving of tf state file to avoid drift with the real environment - Gitlab

Related topics