Best Practices for Terraform Modules

Hi,

Our team has undertaken a project to refactor a large Terraform project by extracting several modules. We have opted to keep each module in its own VCS repository, so that we can build a testing Terraform project for each and implement an automated CI/CD workflow that will apply the testing Terraform project and validate the resources using various API clients (e.g. AWS SDK) in code. Much like what was suggested in this thread: Best practices of Terraform staging testing
But beyond simply carving up the large Terraform project into component sized modules, we wanted to add all ancillary resources into the module. For example, our rds module would not only have AWS RDS resources, but also DataDog resources to manage Dashboards that monitor our RDS DB instances, and possibly Vault resources to manage our database Secrets Engine.

As part of this project we are also adopting Terraform Cloud to track our Terraform state and host our private modules. However, the naming requirement for the repository (Publishing Private Modules - Private Registry - Terraform Cloud and Terraform Enterprise | Terraform by HashiCorp) is causing to doubt our approach. Specifically the necessity to specify a provider as part of the name.

We would very much appreciate any comments on our approach of using Terraform modules as a way of aggregating all resources across different providers that are related to a common component in our system, so that they can be tested as a unit.

Marc

Hi @marcboudreau,

With broad architectural questions like this there is never a universal correct answer, but rather a set of tradeoffs to be made. For that reason, I’m interpreting your question here as gathering more data to inform your tradeoffs and will respond to it with that in mind; I can’t recommend any particular approach for your case because I don’t know all of the requirements and goals, but I can describe some facts about Terraform that might help you make your decision.

When considering the management of long-lived infrastructure objects, it’s often important to consider failure domains: at some point something in your system will fail either temporarily or permanently, and that will have consequences both for the behavior of other subsystems that rely on the failed system and potentially on your ability to respond to the problem.

The second part of that is often an important consideration for Terraform configuration architecture: we typically expect to work with a particular Terraform configuration as if it were a single unit, planning and applying changes across all of its managed objects at once. Because of that, a failure of a control plane or other system you might need to interact with in order to make changes can make it very inconvenient or impossible to work with a Terraform configuration that makes use of that system.

For that reason, one of the axes commonly used to decompose infrastructure into multiple configurations is to isolate failure domains. What exactly that means depends on how your system is built but a common first few levels of failure domains for control-plane-like mechanisms are separate vendors with distinct infrastructure and then separate regions or datacenters within a particular vendor.

With all of that theoretical background out of the way, consider the following real example based on your question: would it be problematic if an RDS failure blocked you from updating your Datadog settings that represent your monitoring of that system? If so, I would err on the side of decomposing those two even though they are thematically related.

This isn’t a universal rule of course, and there is room for compromise even if so. For example, you might decide that there is still value in grouping together all of the infrastructure for a particular subsystem but to do so in a way that allows each component to have multiple decomposed configurations (in Cloud, multiple workspaces) so that they are still thematically connected but are independently operable when needed.

As I say, I can’t recommend anything in particular here because I don’t have all the context, but I hope the above is a useful additional design tradeoff to consider as you design your overall system architecture; you may still ultimately decide that other concerns take priority, and that’s fine! That’s what system architecture is all about. :grinning:

Thank you for your reply. Your point about considering the failure domains was very helpful, since that hasn’t come up yet in our discussions.

The main takeaway is that the approach of grouping thematically related resources that use different providers isn’t a bad approach so long as we have carefully considered all of the cases where a failure from one of those providers prevents updating the entire configuration.

Cheers,
Marc