Terraform Plan Auto Apply Risks

We need to provision multiple sets of predefined AWS infrastructure components. The set of components is fixed, and defined by a database of properties e.g. the properties may include a Swagger file for a Gateway API, some url patterns for a Cloudfront distrubution etc.

There is only a limited and defined variation among these components and their properties. The properties may change over lifetime of the components.

We were weighing the option of using terraform to provision and maintain these resources, vs using writing scripts ourselves using AWS SDK and APIs.

Terraform with AWS provider makes things super easy, as compared to manually writing all the provisioning and modification code. However, the team is concerned about the degree of determinism in the process, especially since we want to auto-apply the changes without human review or interaction.

Let’s say my database of properties changes and I want to make changes to a few resource sets, and add a whole new resource set.

If I can guarantee that only this terraform is changing my AWS infrastructure (No one going to the console to change anything manually, strict Infrastructure as code). Then, will the plan always work in the same way? Of course its a computer program and its ultimately deterministic and will always “work in the same way”, but what I want to ask is, to those who know the internal workings of Terraform (and the generated cloudformation) better, what are the risks of using this approach to managing our infrastructure. What risk do we avoid by taking the more painstaking SDK / API based change management approach vs terraform. Does terraform have complex, conditional optimizations built in which can result in different upgrade and change paths for different change types, which makes human review of the plan always necessary to ensure the adopted change path is not risky?

Putting the same question differently, do you see any risk in using a database to auto-generate .tfvars for fixed terraform provisioning modules, and assume that repeated re-application of different variations of those tfvars and adding/removal whole modules, will always result in the same AWS infrastructure?

Thanks
Asif

Hi @raqsebismil,

It sounds like you’ve already identified the main source of risk with an unattended apply: Terraform makes its plan based both on what’s written in the current configuration and on the state of the existing objects that configuration is managing. Therefore it’s possible that differences in the state of existing objects can cause the same module to behave differently when applied in a different configuration/workspace.

Another similar source of risk, related to the first, is that if you have a sequence of configuration changes A, B, C that you apply separately on one workspace, but you have another workspace that you apply less often and so A+B+C is all applied at once, the result may not necessarily be the same if changes A or B had visible side-effects that are not directly visible to Terraform. This particular variant is uncommon, but it is possible, because not all remote API behaviors can be 100% encapsulated in Terraform’s abstraction. As a straightforward (though rather contrived) example, consider a sequence of changes A, B where A removes resource "aws_instance" "foo" and B re-introduces it with the same configuration: an EC2 instance is a stateful object, so destroying one and then replacing it (A then B) may have a visibly-different result than leaving it untouched (applying A and B together), even though the Terraform configuration is unchanged. How significant this is would depend on what software is running in the EC2 instance.

One final risk is non-determinism caused by Terraform applying actions concurrently and by a remote operation not always taking a consistent amount of time to complete. For example, if two objects have the same dependencies then Terraform is likely to try to apply their actions at the same time, which means that in practice the remote system(s) could perceive them to have arrived in either order. In most cases this isn’t an issue, but can be problematic if e.g. you have a module that is lacking a necessary dependency relationship and thus may either succeed or fail depending on what order those operations end up being taken in at runtime. Most dependencies come “for free” as a result of data flow between resources, but there are some cases where the design of the remote API makes a dependency invisible to Terraform unless explicitly recorded using the depends_on argument. For example, if you create an IAM role, attach a policy to it, and pass that role to an AWS service, Terraform can generally see automatically that the service and the policy attachment both depend on the role, but the role isn’t actually “ready” until the policy is attached and so there is a hidden dependency between the service and the role.

With all of that said, if I were building a system like the one you are describing I would plan to ensure that the following two invarants hold for the full life of the system:

  • Remote objects are changed only by running terraform apply with changes to the configuration.
  • Take care when writing your modules to consider all of the necessary dependencies between resources.
  • Consider carefully the implications of any non-idempotent actions the Terraform configuration takes. Provisioners are a prominent cause of non-idempotence, but some of the APIs Terraform wraps can have non-idempotent behaviors too.
  • Ensure that the same sequence of changes is applied to every instance of the system. If you apply commits A, B, C sequentially to one instance of the system, make sure to do the same for all other instances of the system too, rather than skipping ahead and trying to apply A+B+C all at once.
  • If you have any independently-versioned modules as part of your overall configurations, the above rules must apply to changes to those modules too: if you tested module changes D and E separately during development, make sure that you apply D and E separately in production too, rather than applying D+E together or applying in the opposite order E, D for some callers. (This will probably require extra coordination in your development process to make sure that every module change is tested against the result of the one before it, which means you won’t practically be able to develop two changes to the same module concurrently.)

The above set of constraints is pretty conservative. In practice you can probably get away with being a little more liberal, depending on the characteristics of the resource types you plan to use. In the end, Terraform’s behavior depends a lot on the behavior of remote APIs, and we’re talking in the general case an so I’m taking a pessimistic outlook. A module written with your use-case in mind can mitigate some of the worst-case scenarios through careful design and testing, but that will tend to require deep familiarity with the behaviors of the remote services in question and the interactions between them.


Incidentally, you mentioned in passing in your question the idea that Terraform is generating CloudFormation configuration. I just wanted to note, separately from the rest of this answer, that Terraform does not use CloudFormation unless you explicitly use a the aws_cloudformation_stack resource type: instead, it calls directly into the underlying AWS APIs, the same way that CloudFormation itself would.