We are looking into do an automated pipeline rollback\on fail process when an apply fails (AWS provider) due to any reason. So my plan was as follows:
tf init job does its usual provider\module pulls but also with an additional process that collects the current state file and caches it as the previous state
plan runs as normal
apply runs
if apply fails, get the cached state and put the remote resources back to whats in the previously cached state file
I dont have an issue with the state file collection\cacheing but i have a couple of issues with the approach:
Cant seem to tell terraform to just apply whats in the state file
Terraform seems to require the local .tf files which potentially in this case has an issue so it needs to ignore these files and just apply whats in the state file
It just needs to make AWS look like whats in the provided state file, no plan, no terraform config files.
The other thought we had was that terraform could just undo the plan\computed applies that were successful but i think this is probably more of a terraform roadmap item more than a user implemented thing.
In general “rolling back” isn’t a well-defined operation for the kinds of objects Terraform is typically managing, because changes made during apply are often destructive. For example, if Terraform proposes to destroy an object and create a new replacement, there isn’t any way to roll that back; the closest appropriation would be to again destroy the object and create a new one that is configured similarly to the original, but it would still not actually be the original object.
Instead when using Terraform it’s more common to roll forward by correcting whatever made the apply fail and then planning and applying again. One possible way to “correct” your configuration would be to change it back to how it was before the problem, but that’s not always sufficient and so human intervention is typically required.
Therefore I would recommend that you don’t try to implement any sort of automated actions on error. Instead, the usual approach is to have a human participate in the change process, notice any errors, and then respond based on their judgement and possibly based on run books describing solutions to common problems.
Terraform is not designed for completely unattended usage. Instead, it works best in partnership with a human operator who can review what’s being proposed, approve it if it looks correct, and react by making further changes if anything goes wrong.
Thanks for your detailed response. I think though that there are quite a few scenarios where its not destructive for example network\security\mis-configurations where user-intervention would not be needed. This drastically reduces potential outages of a service due to it not needing user-intervention and can also be double checked that the service is back functional after roll back with a post-roll back testing job within the pipeline and reported on accordingly. The aim is to have a trust worthy test package for deployment handling so this test shouldn’t be any different.
Even in the case of a destructive loss, for example, an EBS volume being replaced (you would have a loss whether you had a rollback option or not), it would still be easier for user-intervention if the service\infrastructure was back as it was without it being in a halfway house state and the user possibly deviating from the TF state further. Running a roll back under these circumstances would not necessarily make anything worse, generally it would be better than if it didn’t roll back.
I get that there are scenarios that i am not aware of but potentially those resources could be controlled with a lifecycle flag to disable any rollback if it was an issue.
If you do wish to implement a system that attempts rollback with Terraform then this would be something you’d implement outside of Terraform using Terraform only as the core execution engine, with your automation driving it.
You cannot use the state as configuration because the state is not complete enough to act as configuration – Terraform uses it only as a supplement to configuration to track information that the configuration cannot. But you could instead attempt to switch back to the previously-applied configuration and apply that.
Here’s a high-level flow that might work and be a reasonable compromise:
try applying the current set of changes. If it succeeds, then save the current commit id as the most recent success and terminate here.
look up the commit id of the most recent successful apply. If there is no known id, terminate with an error.
switch back to the discovered commit id and create a plan using terraform plan -out=tfplan. If this fails, terminate with an error.
use terraform show -json tfplan to retrieve a machine-readable plan description. Use a program you have written yourself to analyze the plan and decide if it seems like a safe rollback. It is up to you to decide what “safe” means. If you decide unsafe, terminate with an error.
try to apply the save plan with terraform apply tfplan. It will either succeed or fail, and either way this process concludes here.
A process similar to the above should achieve a “best effort” rollback mechanism which will roll back if possible and raise an error inviting human intervention if not.
You should be careful how you define whether a particular plan is a “safe rollback” to minimize the risk of the rollback attempt making things worse rather than better. What is “safe” will depend on the characteristics of your system.
From Terraform’s perspective this is a “roll forward” to a new configuration that just happens to match the old configuration. This can work in many cases but there are some situations where it will not work. For example, if the latest commit added an entirely new provider and a resource for that provider, switching back to the old configuration will remove both the resource and the provider configuration, and so Terraform won’t have the provider settings needed to destroy the remote object. There are other similar situations where rolling back will remove information Terraform would need to plan or apply. You should expect the rollback to fail sometimes when making changes like this.
We were hoping to avoid the commit tracking but is good to know about the state file and its config though. Potentially the last known good commit id would be something that could be stored in the state by terraform that only gets updated on full successful apply.
In the meantime though, i will look into the tracking option.
Terraform itself has no notion of “commits”; it just uses whatever files are in the local directory where you run it. So without a very significant change to Terraform’s execution model, somehow making it aware of version control systems, I don’t expect Terraform alone would be able to support this sort of rollback situation.
One possible way to hack it would be to include an input variable where the caller provides a commit ID and then you echo that ID back into an output value which is written to depend on everything that might potentially fail:
variable "commit_id" {
type = string
}
output "commit_id" {
value = var.commit_id
depends_on = [
# List every resource and module call in
# the root module here.
]
}
The depends_on is important here so that Terraform will delay updating the output value in the state until everything else has completed.
With this in place, you can expect that terraform output -raw commit_id will return the most successfully-applied commit ID from the state, since Terraform would not get the opportunity to update the commit_id output value if something that it depends on fails to apply.
Personally I think it would be easier to implement this commit ID tracking outside of Terraform using a separate system, but with a hack like the above you can lightly abuse root module output values as a sort of “memory” between runs, as long as you carefully constrain when Terraform is allowed to update them to be ordered correctly relative to your configuration’s other side-effects.