How to completely reverse, destroy ALL resources if a local or remote exec provisioner fails?

The documentation for “Provisioners” and “Taint (deprecated)” state “Terraform does this [taints a resource] because a failed provisioner can leave a resource in a semi-configured state” and “… other users could create a new plan against your tainted object before you can review the effects”, respectively.

Both of these, exactly, are happening to us, and I’m looking for guidance on how to address.

We spin up multiple VMs (nodes) for HA Kubernetes from scratch and use kubeadm and other basic, native kubernetes tools with terraform’s local-exec and remote-exec to setup the cluster and worker nodes along with separate etcd and api servers. The number of VMs is arbitrary based on the user’s needs and the intensity of the workload.

The problem is that sometimes, for whatever reason (network problem, number of VMs requested can’t be met at that particular time), one or more nodes (worker nodes, for example) fail to be integrated into the cluster. Terraform rightfully taints the resources to be recreated on the next apply. But the next apply never happens because, one, this is all automated, and two the end user just interacts with the working cluster. They have zero visibility to inspect resources.

Behind the scenes one or more nodes is orphaned. Technically the infrastructure was created, but it’s just sitting there doing nothing except costing the user money unnecessarily. We only find out about it when the end user reports that the infrastructure can’t meet the expected needs of the workloads.

In our opinion, this provisioning has to be part of the infrastructure, not configuration (and can’t be addressed by Ansible or other tools for the same reasons).

Taining a resource for recreation, per the terraform docs, isn’t enough. How can I structure my automation code to COMPLETELY destroy every single resource that was created by terraform apply up to that point?

Is there a way to do this within terraform itself, like a full rollback, or would I have to create external scripts? If the latter, what am I looking for from the terraform command I can parse and then decide to automatically run terraform destroy? Ultimately, I need to detect that one or more of the provisioner exec resources failed, destroy the entire rollout, and notify the admins.

Thanks!

Hi @bluepresleycom,

My first thought when reflecting on your description of the requirements is to rely on Terraform CLI returning a nonzero exit code when it encounters an error during apply, with pseudocode something like this:

  • Run terraform apply. If the exit code is zero, terminate successfully.
  • If we get here then there was an error during apply, so run terraform destroy to destroy everything that was created. If the exit code is zero, terminate as unsuccessful but cleaned up
  • If we get here then destroy failed to. Generate a notification of some kind to alert an operator of the problem, and then terminate as unsuccessful and unclean.

However, you also mentioned using Terraform provisioner blocks to do some of this work, instead of using normal Terraform resources. Terraform cannot know what actions a provisioner is taking and so it cannot automatically undo those actions. terraform destroy can potentially run destroy-time provisioners during the destroy phase, but at that point you’re not really using Terraform as anything other than a shell script with less convenient syntax. It might still be worth it, but I would also evaluate each of the following alternatives:

  • Don’t use Terraform for the kubernetes setup parts of this process at all. Instead, just script the kubernetes-specific tools directly, running the same commands you would’ve run through provisioners directly in a shell script. If any of the commands fail then run a second “destroy everything” script that matches what the destroy provisioners would have done. This would be functionally equivalent to and considerably simpler than a Terraform configuration that does all of this same work using provisioners.
  • Use the hashicorp/kubernetes or hashicorp/helm provider to configure Kubernetes declaratively using resource blocks. These providers implement both “create” and “destroy” actions for each resource type, and so terraform destroy can automatically destroy whatever exists and ignore what doesn’t exist, without you having to manually script the create and destroy actions separately.

In all of this I’m assuming you only really use Terraform for creating and destroying the whole setup at once, and don’t use Terraform for ongoing operations once the cluster has been initially configured. Things get more interesting if you also need to do in-place updates sometimes, because in that case it would presumably not be appropriate to destroy everything just because one update action failed.