I would like to confirm/correct my understanding about TF execution.
Suppose I need to manage a compute instance in a public cloud.
I have the machine definition in my config file (.tf). I apply it and the machine gets eventually created. While this happens, the same state is recorded in the state file.
After the machine is created, I go to the web console and manually delete the machine.
Now the state file still thinks that the machine is existing and the config file still has the configuration spec of the machine.
If I apply the config file, I see that the machine is created again.
So during plan, what happens is that always what we specify in the config file is compared with the real existing infrastructure and the plan is made. After apply, the created infra is recorded in the state file.
Then what is the use of state file here? It seems that it is not used in the plan stage.
When working with Terraform, consider there are actually 3 states:
The ‘desired state’, as per your terraform module/configuration that you are applying
The ‘current state’ which is the state of the resources as currently in the cloud provider
The ‘last known’ state which is what is recorded in the Terraform state.
Yes, in reality, a plan could be generated as a ‘diff’ between the current state and the desired state but this plan may not be adequate:
Terraform is able to use the stored state to indicate if anything has changed outside of terraform. e.g. if the current state is different from the last stored state. This is good for tracking configuration drift and determining if there has been any config changes carried out manually, which is typically not desired when managing resources with Terraform
Additionally the state file records dependencies between resources - which is needed when deleting a resources from the configuration.
Without the stored state, the plan to make the ‘current state’ match the ‘desired state’ would be to remove the resources. But in what order? Terraform does not understand the dependencies of the resources for each provider.
As an example, assume you’re removing a subnet from your VNET config, along with a last, redundant VM that is no longer needed. Without knowing the dependency from a stored state (and it is no longer defined as the resources are not in the terraform config any more) Terraform may instruct the cloud platform to destroy the resources in an order that would cause the platform API to throw an error (for instance, removing a subnet before removing the connections to it from the VM NIC ). Dependency management is even more complex if using multiple providers.
Another benefit as detailed in the docs is that when managing a large infrastructure, you can choose to use only the stored state as the source for the ‘as-is’ and plan directly against that. This can be more performant than having TF provider query every managed resource (Due to the cloud platforms’ API limitations). But in such an instance you should be very sure that unexpected changes to the deployed resources cannot be made ‘outside’ of the terraform workflow.
There’s probably a bunch more but these are the ones that come to mind.
I have worked extensively with ARM templates and BiCep in Azure for IaC (neither of which have a ‘stored state’) and have run into multiple issues where having a ‘stored state’ would have helped.