Our timeout for the CI/CD Pipeline Runner on GitLab is 2 hours.
If the timeout is reached and terraform is still running “terraform apply” then there is a drift between the State file and the Real environment.
Is there a way we can setup graceful exit of “terraform apply” (may be at 1 hour 45 minutes) then save the State file and gracefully exit and not wait until the runner times out?
I’m not familiar with GitLab’s execution environment and so I can only offer some general information about how Terraform itself behaves.
If Terraform receives the signal SIGINT then it will begin a graceful shutdown process which will:
attempt to persist a snapshot of the current interim state to the configured storage
ask all providers that are currently performing operations to stop as soon as they safely can, possibly returning an error
try to persist additional state snapshots for every provider operation that completes promptly after it completes, to minimize the amount of loss if the process is then terminated more aggressively.
Once Terraform enters this “graceful abort” state, any further signals will be taken as a request to abort abruptly, without waiting for anything to complete.
With all of that said then, perhaps you can find some way to introduce an additional early timeout which signals Terraform as described above, and send that signal early enough that Terraform will have some time to complete its graceful abort before reaching the two hour timeout.
How about saving the tfstate file before terminating the terraform process?
If terraform is terminated by your CI/CD automation process when hit the timeout, you can just copy the tfstate file to a temporary or permenant directory before terminating it.
Add a system call “cp -r terraform.tfstate /YOUR_DIRECTORY” before killing terraform process may help. However, drift may also occur because the terraform process gets interrupted.
thank you for your suggestion. Would SIGKILL work the same?
I am thinking to send “KILL -9 PID” at around 100th minute giving 20 minutes to “terraform apply” to process the SIGKILL and save the STATE. My timeout is set to 120 minutes.
The “kill” signal is handled by the kernel rather than Terraform and so does not give Terraform any opportunity to perform any shutdown actions.
The “int” signal is short for “interrupt”, and that is the one that gives Terraform an opportunity to perform any needed actions to try to shut down early without data loss.
Thanks for confirming the SIGKILL will not work in this case.
“man 7 signal” gives me information about different types of Signals out there and this is what I see.
SIGINT P1990 Term Interrupt from keyboard
If SIGINT is issued via keyboard manually, how can I programmatically issue this signal?
Yes, the kill command is a reasonable way to send arbitrary signals to a process, including the SIGINT signal.
That command would call the same Linux kernel function as your shell’s job control features would use to respond to you pressing Ctrl+C; Terraform won’t be able to tell the difference between these two as long as you make sure you’re sending the interrupt signal in particular.
A process running as PID 1 inside a container is treated specially by Linux: it ignores any signal with the default action. As a result, the process will not terminate on SIGINT or SIGTERM unless it is coded to do so.
I’m not familiar with that specific situation – I typically run Terraform as a child process of a shell or other supervisor program – but based on the wording of that documentation I would expect this to work anyway.
The important part is the last statement “unless it is coded to do so”. The statement isn’t clear exactly what that means, but the previous part talks about the “default action” for a signal, which is a mechanism that allows the kernel to decide what to do with a signal when the recipient process hasn’t specified any specific code to run when that signal arrives.
Terraform does register some code to run when the signal arrives – the code that triggers the graceful shutdown behavior – and so I don’t think this statement applies to Terraform. The kernel should run Terraform’s handler for that signal regardless of which PID Terraform was assigned.
Thank you. hope there is some sort of “trap” in the terraform code to catch the signals.
As I was trying my script, I found that in gitlab pipeline script, we cannot run tasks in the background. Like nohup or &.
So if I have script section in a Job and I want to run “terraform apply” in the background so that my script can progress to the next line with “if condition”.
If I cannot do that then “terraform apply” cannot be interrupted.
I hope there was some timeout parameter that we could pass to “terraform apply” as an optional parameter.
sample script
script:
end=6000
pid=$$
echo “$pid”
terraform apply
|-
if [ “$SECONDS” > “$end” ]; then
echo “inside if loop $SECONDS and $end”
kill -SIGINT -$pid
fi
I would typically expect this sort of interrupt behavior to be provided by the job system itself, rather than something you would need to implement as part of the job.
I’m not familiar with GitLab and so I can’t promise it has such a feature, but I would find it surprising if GitLab doesn’t offer something for this, because just immediately sending SIGKILL to a process without interrupting it first and giving it a chance to terminate itself would be far too strict for any software that has any kind of external state. I would suggest asking GitLab how you can configure their system to terminate jobs gracefully, as a first preference.
If you do find that GitLab’s execution environment is lacking such a feature, and you have no option of using any other execution platform, then my backup suggestion would be to run Terraform as a child process of a supervisor program that runs in the foreground.
The basic structure of such a program would be:
Call the alarm system call to arrange for the kernel to send a signal after a given number of seconds.
Use the fork+exec system calls (or something equivalent) to launch Terraform as a child process.
Register a SIGALRM signal hander, which will run if the process runs long enough to get that alarm signal. That signal handler should send SIGINT to the Terraform child process, to tell it to start shutting down.
Use waitpid to block until the Terraform child process exits.
Use the exit code from Terraform as the exit code from the supervisor program so that the GitLab environment can still react to it.
As I mentioned above, I would expect any job execution system like GitLab’s to offer equivalent functionality itself anyway – this is the basic function of a job executor – but if GitLab is lacking this fundamental feature for some reason then it’s possible to implement the same thing yourself as a child process of theirs, but you will probably need to do it in a language other than shell scripting because shells are not really designed for this sort of thing.