I have a module that creates an AWS Aurora cluster (aurora_module) and a module that creates a security group (sg_module). The security group id is an output of sg_module and that is passed as a parameter into the aurora_module. The aws_security_group resource in sg_module has the create_before_destroy lifecycle enabled.
I made a change to the sg_module that is causing the security group to be recreated. It creates the new aws_security_group before trying to destroy the old one. However, the aurora_module is not picking up that there is a change in the parameter that specifies the security group ID. Because of this, the apply fails with a timeout trying to delete the old security group that is still associated with the aurora cluster.
Does this need to be a multi step process?
Apply to create new security group.
Apply to update security group ID passed into aurora module and destroy the old security group
Or is there some method I am missing (besides triggering a recreation of the aurora cluster)?
The overall structure of what you’re describing should work within a single plan and apply, but without any details I can’t say where things might be going wrong.
When the aws_security_group is planned for replacement, does the aurora cluster show a corresponding change at all? Were the resources all applied with the create_before_destroy option before making this change?
EC2 won’t allow deleting a security group that’s associated with at least one network interface, and it sounds like the Aurora cluster’s network interfaces are associated with the security group.
Therefore the correct order of operations would need to be something like this:
create the new security group
update the Aurora cluster’s network interfaces to refer to the new security group
wait for EC2 to become consistent with the Aurora change
destroy the old security group
A common gotcha is what I’ve labelled as step 3 above: it takes some time after destroying or reconfiguring a network interface before the associated security group becomes unblocked for deletion, and (as far as I know) there is no way to know it’s ready except to keep trying until it succeeds. I think the AWS provider tries to deal with this by effectively incorporating step 3 into step 4, polling the “delete security group” API until it eventually succeeds. But if step 2 didn’t happen then it can never succeed, and so will poll until the operation eventually hits a timeout and returns an error.
With that said then, I think what @jbardin asked is the crucial point. Did Terraform mention the need to update the aurora cluster as part of the plan? That’s a different way of asking whether your apply phase is performing what I labelled as step 2 in the above list, since Terraform will not attempt to update the Aurora cluster during the apply phase unless it said it would during the planning phase.
If you’re not sure, then it might help to share the entire output of terraform plan showing the proposal to replace the security group, along with anything else that was proposed at the same time.
No, the aurora cluster does not show a corresponding change at all in the plan. The aws_security_group resource has a create_before_destroy lifecycle on it. It did create a new SG before trying to delete the old one.
If there’s no change in the aurora cluster, then there is either a mistake in the configuration which isn’t directly linking the security group output to the cluster, or there is a bug in the resource which is ignoring the change in configuration. The fact that the security group id is changing because the security group is being replaced entirely should definitely show up as a change elsewhere in the configuration where that id is referenced (even if the order of operations was somehow incorrect).
You are hitting a bug in the aws_rds_cluster resource, which is not detecting a change in the vpc_security_group_ids, partly because the attribute is optionally computed, and partly because the legacy SDK cannot differentiate at that point between unknown and unset.
Normally it wouldn’t matter, but the reason the value is entirely unknown, is that compact cannot tell which elements will be duplicates until all unknown values are resolved, so the entire value becomes unknown. In this particular case you could leave out the compact call, so that the value sent to the provider is a set containing an unknown element rather than an unknown set. The fact that the data structure is a set will implicitly remove duplicate values.