Controlling wait duration between module deployments

I have a setup which mandates the sequential installation of EC2 instances - mapped to an AutoScaling group . Each of the instance is treated like a component of the larger stack and needs to be executed one after the other - once their respective userdata.sh scripts are successfully executed. Each of the component takes several minutes to fully install all the required services(userdata.sh) .

Currently I am using “time_sleep” resource to give enough time for the component to install before proceeding to the next component . Ideally i would like to trigger the next module after the successful installation of the previous module .I am looking for an elegant solution to handle this sequential deployment.

To elaborate further the setup uses terraform modules:

module componentA → resource “time_sleep” “wait for 20m” → module component B ->…and so on.

For instance module componentA , inturn calls a Launch Template , with an AutoScaling group - therefore terraform doesn’t have direct control over the instances created to invoke a SSM(systems manager) agent to check the status of cloudint . Guess this is the core challenge!

I am looking for some way for the EC2 instance to notify an existing terraform deployment (if there is one happening) before proceeding to the next module (in this example “module component B”). Is there some webhooks i can use, to notify an ongoing terraform apply in lieu of “time_sleep” resource.

Appreciate your inputs.

Thanks.

Terraform just isn’t designed for that sort of orchestration. It expects to produce a dependency graph from looking at the code and how things link together and then will make changes from the bottom of that graph, running things in parallel when it can. Once a change is complete it will move onto the next one. It only can tell that a change is complete once the provider says so (which depends on whatever APIs are used with the service being managed). In general it doesn’t have any way of knowing if that means the thing is actually ready - as you’ve seen this is often not the case, such as for EC2 instances which need to run lengthy cloudinit scripts or things which take a while to fully replicate across the globe.

Anything you try within Terraform is likely to just be some form of “hack”. As you’ve mentioned one common approach is to have something which causes a delay, but those aren’t great as the delay has no direct relationship to the process that’s happening (usually you set it to something pretty long, but that just slows things down).

If you have complex orchestration needs I’d generally suggest bringing that out of Terraform. One way is to split your code so you have multiple root modules, then have some sort of script/tool that runs them one after another as needed. You can add in whatever custom checks you need yo ensure the process has completed.

Thanks Stuart for your reply. I guess i will have to split my code into multiple root modules , as you have hinted! Kind of new to terraform so was keen to fish out some ideas to solve this particular problem. Thanks for sharing your thoughts.

I broadly agree with @stuart-c’s assessment above; this sort of tightly-sequenced orchestration isn’t really within Terraform’s capabilities due to how it operates “at arm’s length” from the internals of the infrastructure objects you’re deploying. It often cannot even see the internal state of an object like an EC2 instance or a database, so cannot react to it.

In previous jobs when I was a Terraform user rather than a Terraform developer a typical pattern I used and saw others use was to try to make the different components “self organize” in some way. The details of that depend on what software we’re talking about, but the general idea is for the software running in the EC2 instances to be responsible for finding and learning the status of the other components they depend on.

This can be a good idea for other reasons anyway: if one of your instances fails and gets replaced by autoscaling then Terraform won’t be around to help the new instance connect with the others, but if they can discover each other and connect automatically then the system can potentially heal itself without any manual intervention.

A relatively simple implementation of that is to design the software so that if it cannot reach its dependencies it will keep retrying (at a polite rate) until it can, and a server might refuse to take any requests from its own clients until its dependencies are available so that the load balancer (or similar) can route requests to nodes that are ready to work. This can often be sufficient for getting things running, but doesn’t give you any central place to observe the health of the individual components and their connectivity.

A more complex answer would be to use something like HashiCorp Consul so that your systems can form a cluster and announce themselves into a service catalog so they can discover one another, and possibly connect to one another using Consul’s network overlay. Systems like Consul allow you to have a central place to see what’s going on and understand that e.g. one particular service didn’t start up correctly and needs you to intervene, without inspecting each server individually.

However you end up achieving it, the idea is that Terraform only arranges for the individual autoscaling groups to get started and is “hands off” for the rest of the startup process and for any reconnections that need to happen at runtime when instances fail. If you need to reconfigure the instance templates or other settings then you’d use Terraform to do that, but Terraform would not be directly involved in directly controlling the startup of individual software and connections between nodes.