The documentation for “Provisioners” and “Taint (deprecated)” state “Terraform does this [taints a resource] because a failed provisioner can leave a resource in a semi-configured state” and “… other users could create a new plan against your tainted object before you can review the effects”, respectively.
Both of these, exactly, are happening to us, and I’m looking for guidance on how to address.
We spin up multiple VMs (nodes) for HA Kubernetes from scratch and use kubeadm
and other basic, native kubernetes tools with terraform’s local-exec and remote-exec to setup the cluster and worker nodes along with separate etcd and api servers. The number of VMs is arbitrary based on the user’s needs and the intensity of the workload.
The problem is that sometimes, for whatever reason (network problem, number of VMs requested can’t be met at that particular time), one or more nodes (worker nodes, for example) fail to be integrated into the cluster. Terraform rightfully taints the resources to be recreated on the next apply. But the next apply never happens because, one, this is all automated, and two the end user just interacts with the working cluster. They have zero visibility to inspect resources.
Behind the scenes one or more nodes is orphaned. Technically the infrastructure was created, but it’s just sitting there doing nothing except costing the user money unnecessarily. We only find out about it when the end user reports that the infrastructure can’t meet the expected needs of the workloads.
In our opinion, this provisioning has to be part of the infrastructure, not configuration (and can’t be addressed by Ansible or other tools for the same reasons).
Taining a resource for recreation, per the terraform docs, isn’t enough. How can I structure my automation code to COMPLETELY destroy every single resource that was created by terraform apply up to that point?
Is there a way to do this within terraform itself, like a full rollback, or would I have to create external scripts? If the latter, what am I looking for from the terraform command I can parse and then decide to automatically run terraform destroy? Ultimately, I need to detect that one or more of the provisioner exec resources failed, destroy the entire rollout, and notify the admins.
Thanks!