Terraform AWS ec2 doesn't gracefully shut down

I’m performing terraform apply, that destroys and creates a new ec2 instance.

The instance that gets destroyed has a shutdown script that takes several minutes to complete in order to gracefully shut down running software.

It seems that normal machine reboots and cycles properly fire up said script. However, when instance gets destroyed and re-created there are signs that during destruction the machine did not properly shut down.

My shutdown script listens for these events:
WantedBy=halt.target reboot.target shutdown.target

Does terraform fire these events and await graceful ec2 shutdown before destruction? How can I make sure terraform apply allows my machine gracefully shut itself down before destruction?

Have you tried adjusting the timeouts?

Timeouts

The timeouts block allows you to specify timeouts for certain actions:

  • create - (Defaults to 10 mins) Used when launching the instance (until it reaches the initial running state)
  • update - (Defaults to 10 mins) Used when stopping and starting the instance when necessary during update - e.g. when changing instance type
  • delete - (Defaults to 20 mins) Used when terminating the instance

I looked at those properties but they seem like timeouts that terraform uses for it’s own purposes - to know when an operation has failed. As opposed to giving a set amount of time to the instance itself for shutdown.

Terraform’s AWS provider implements destroying an individual aws_instance instance by calling ec2:TerminateInstances and then polling periodically until the instance status shows as “terminated” as far as the EC2 API is concerned.

Terraform has no direct control over how EC2 implements that shutdown, how the software inside the EC2 instance responds to being asked to shut down, or how long EC2 will wait for the shutdown to complete.

The EC2 guide Troubleshooting Terminating (Shutting Down) Your Instance suggests that EC2 will give the instance an opportunity to run shutdown scripts before the instance is finally forcefully terminated.

Elsewhere in the EC2 docs, there is another section What Happens When You Terminate an Instance, which explains that TerminateInstances causes the EC2 system to send an ACPI Shutdown event (similar to what happens when you press a power button on a physical computer) which software in the instance must listen for and respond to. In your case it sounds like you are using systemd, in which case it’s systemd that would respond to that event, as you described. Although it’s impossible to say for certain what’s going on with your system from here, my first theory would be that the systemd configuration isn’t quite right and so systemd is not running the script as you intended.


While not directly related to your question, I want to note that I’d recommend using aws_autoscaling_group to launch EC2 instances from Terraform rather than aws_instance directly. In that case, Terraform simply configures EC2 autoscaling and then autoscaling in turn manages your instances. This is helpful in many situations because EC2 autoscaling can then constantly monitor your instances and replace them if any fail, whereas Terraform can only react to changing infrastructure when you explicitly run it.

Although it’s impossible to say for certain what’s going on with your system from here, my first theory would be that the systemd configuration isn’t quite right and so systemd is not running the script as you intended.

Can you elaborate on this any further? Maybe with a link to proper implementation example?

This is my shutdown service:

[Unit]
Description=Gracefully shut down remnode to avoid database dirty flag
DefaultDependencies=no
Before=shutdown.target reboot.target halt.target

[Service]
Type=oneshot
ExecStart=/root/node_shutdown.sh

[Install]
WantedBy=halt.target reboot.target shutdown.target

and this is the script it calls

#!/bin/bash
remnode_pid=$(pgrep remnode)

if [ -n "$(ps -p $remnode_pid -o pid=)" ]; then
    kill -SIGINT $remnode_pid
fi

while [ -n "$(ps -p $remnode_pid -o pid=)" ]
do
    sleep 1
done

I’m not knowledgeable enough about systemd to give a definitive answer here, but some quick searching showed various examples of using services with ExecStop set on them pointing to a script that systemd would run when shutting down that service. It looks like you can just set ExecStop without also setting ExecStart. I don’t know if that will work, but hopefully it’s relatively easy to try and see!