Best practices for implementing SIGTERM/CTRL+C/cancel in Google provider for partial admission operations

btleedev · August 18, 2021, 4:36pm

Hi,

We’re looking for some guidance on best practices for implementing SIGTERM/CTRL+C/cancel in the Google provider. We’re seeing a lot of 409s happen in the provider when SIGTERMing it. Below is the flow that reproduces our issue.

User has no Terraform state.
User creates a Terraform manifest for the resource google_compute_instance.
User runs terraform apply.
The Google Terraform provider first issues a request to insert this instance via GCloud instances.insert API.
Google returns 200 OK and something called a GCloud Operation in the response. Note that this 200 OK only indicates partial admission.
The provider then sets the id for the resource.
The provider polls on the GCloud Operation and waits till it finishes. When it finishes, this indicates full admission.
Suppose now a SIGTERM is issued before the Operation is finished. This cancels the context Terraform feeds to the provider.
In the current implementation, the polling in step 7 is cancelled. Because we only know partial admission, the provider cannot concretely determine that Terraform should manage this resource, and then sets the id to “”.
No Terraform state is written for this resource.
Eventually a while after Terraform was cancelled, the Operation completes and a Compute Instance is up and running.
User runs terraform apply again.
Terraform has an empty state, so it tries to create the resource google_compute_instance again.
User gets a 409 resource already exists.

A couple questions.

In step 7, our polling loop uses the context fed by the Terraform Provider SDK. This is why it is cancelled in step 9 when a SIGTERM is issued. Do you have any guidance for how we can ensure our provider eventually reaches full admission for our resources? Below are some thoughts.
1. In step 7, should we instead use a context for our polling loop that is not a child of the context from the Terraform plugin SDK? We would still adhere this context to timeouts for the resource.
2. Is there some place we can save the GCloud Operation id so that we can resume on the next apply/plan? In PR 4501, we saved it to the schema but this was deemed as hacky.

paddy · September 8, 2021, 1:28pm

The best way I can think of to handle this is:

When an operation is returned, persist that in the schema. Technically, there’s a private, non-state resource storage area you could put it, but it’s not available in SDKv2 right now. So putting it in the schema is really the only place to hold on to information you’ll want later. You’re also going to need to set an ID for the resource, no matter what. You could, in theory, set the operation as the ID, though that may get messy.
When calling Read, if the operation is set, wait for the operation to be completed before moving forward.
When an operation completes, remove it from state.

Using a second context that’s not derived from the gRPC contexts is not advisable–Terraform may get impatient and just terminate the provider process.

As a general rule of thumb, however, I believe the recommendation is against running Terraform in environments where it regularly receives SIGTERM–it’s supposed to be a relatively rare event, not an everyday occurrence, by my understanding.

btleedev · September 13, 2021, 5:30pm

Thanks for the responses, I had a few follow up questions.

Any idea when this would be available?

How long does Terraform wait before forcefully killing the provider process? Is this something we can configure?
Some of the resources have timeouts, I don’t see why creating a new context which respects the user’s defined timeouts is not advisable. Because the user sets these timeouts, they represent more of the user’s intentions rather than the current behavior which is to ignore the timeouts and cancel immediately. I think on SIGTERM, the user would expect everything to wrap up at the of cost of waiting on the timeouts they set themselves via Terraform manifests, as opposed to the current behavior which commonly produces 409 errors. If the user gets impatient, they can always issue another SIGTERM or just SIGKILL the processes too. WDYT?
- Ideally if we could somehow tie the timeouts to the original context that Terraform cancels on SIGTERM that would be ideal.

Not sure if you would know or can disclose this information, but how does Terraform Enterprise handle this kind of behavior when say upgrading to a newer release? Is this considered a rare event and does it just issue SIGTERM/cancel workflow jobs?

paddy · September 13, 2021, 6:14pm

It is unclear at this time if it will be added to SDKv2, and if so, when. It’s more likely that it would be added to the framework, but it’s also not on that roadmap, so I can’t promise it will happen on a certain timeline, or at all.

You should be prepared for it to happen at any time, as the first context cancellation will come from Terraform when a user sends SIGINT, and if the user sends SIGINT again Terraform will immediately terminate the process, as described in the message Terraform shows after the first SIGINT. I don’t know that Terraform has implemented a hard time limit on provider processing after the first SIGINT today, but I also don’t know that I would want to rely on it never doing so in the future, as I don’t believe the core team has ever stated that providers should be given unlimited time to shut down. As a general rule of thumb, Terraform started the process and owns its lifecycle, so if I were working on Terraform core (I am not) I would, at least, feel it was fair for me to force a process to end after a certain amount of “clean up” time. If you’d like to get a promise for unlimited clean up time from Terraform core, an issue is probably the right place to start.

The timeouts are not intended to describe the amount of grace period after SIGINT is received that a resource is allowed to continue operating, they are meant to describe the maximum amount of time a resource will be allowed to operate. There’s some confusion over whether that time should start before or after any concurrency limitations, and why, so that’s left for provider developers to determine, but the intent is that a timeout is meant to represent an outer bound. So I’m not sure they relate to a situation in which a SIGINT is sent.

I’m not sure what you mean by “newer release”, but I’m also not sure it’s wise for me to speculate, as I don’t work on those projects or have any details on them, and don’t want to misinform or mislead.

paddy · September 13, 2021, 6:22pm

Apologies, it’s occurring to me that I confused SIGINT and SIGTERM and have muddied the situation horribly. Let’s try and unwind it, with my apologies.

SIGINT (Ctrl-C) has special handling that will tell Terraform to interrupt what it’s doing. Terraform will cancel the context, and tell the user to send SIGINT again if they would like to shut down immediately. So you should be prepared for the process to be killed at any moment, because the end user can get impatient and send SIGINT again, and Terraform will dutifully kill the process.

SIGTERM has special handling that will tell Terraform to wind down. I don’t know (I haven’t tried it) but it may or may not print out the “send it again to halt immediately” message. Terraform will cancel the context, and (if I’m reading this code right) forward the SIGTERM on to the provider.

I’m not seeing any code that would terminate SIGTERM after a deadline, but I would still not assume it can’t happen, as that (seems to be?) in line with the purpose of SIGTERM. I’d still advocate for an issue if that’s behavior you’d like to rely on.

Apologies again for the confusion there.