We have a fairly robust terraform config in our company. However, all resources for a given environment (master, prod, stage) sit in a single state file.
It seems that the recommended best practice is to fragment the terraform config out such that there are a bunch of smaller state files, each responsible for a few resource clusters (vpc, service A, service B).
However, what I can’t find, is any info on how plan time relationships are handled in this highly fragmented terraform setup.
Say I want to deploy changes to service A, which includes changes in the vpc. What strategies, if any, are there for seeing the collective plan output of multiple states, for a given change, when they’re fragmented?
Yes, I’m aware of remote state
That however doesn’t address the problem of running terraform plan against a highly fragmented state
Say I make a change in State1, which I needed to make to get a service in State2 to work. I wont know if my change in State1 worked until I apply it, and run plan against State2.
If State1 and State2 are the same state file, you don’t have this problem, because all relationships are in the same state, so the plan output will encompass all the changes
If you are building a system from decomposed subsystems then you’ll typically need to define explicitly what each of your subsystems is responsible for and have mechanisms in place so that you can change and test each subsystem separately.
If you have a system where objects A and B are so coupled together that you can’t test A without also having B then that is probably not a good candidate for an interface boundary between subsystems. If two things often change together then better to keep them together in the same subsystem (that is, the same configuration).
The key to all this is carefully selecting where you draw the lines between your subsystems. That’s unfortunately something that depends a lot on the situation, and so it’s hard to give general advice about it that is universally applicable. Rate of change is often a good first criteria to split things up by though, which is why for example separating the configuration of the network fabric (VPCs and subnets in AWS) is often the first split teams do: the underlying network topology generally changes very infrequently in relation to the other items that make use of it.
However, doing that does require having some procedures in place in order to verify that, for example, a newly-provisioned subnet has the appropriate routing rules configured, even though nothing “real” is running in that subnet yet. That can be handled in a number of ways, such as configuring the subnets fully systematically (so that they’ll all have the same routing rules by definition), by reviewing the created subnets manually in the console or API (often reasonable because network topology changes infrequently), or by writing a one-shot Terraform configuration that you can temporarily apply into the new subnet in order to exercise it and then destroy it before deploying anything else there.
These approaches won’t necessarily generalize to other subsystem boundaries, and it might not even work for your particular system if your workflow is different than what I’m imagining, but I hope there are some ideas here that you can adapt.
With all of that said, if you’re happy with having each environment entirely described in one Terraform configuration then I wouldn’t rush to decompose it just because “best practices” say so. It’ll be easier to decompose well if you have a specific motivation in mind for doing so, because you can then use that information to decide how to decompose. It is true that single “whole environment” configurations to tend to become clunky to use over time, but I’d always advise waiting until you have an actual problem to address before making an architectural change, because otherwise it can be hard to know whether your change is actually addressing a real problem.
Thank you for taking the time to share your insights @apparentlymart
Good to know for a fact then that there’s no out of the box solution for the problem I’m describing
I’ll have a think about what you said and see if our services could fit into the approaches you’re describing (we do have some problems with our current approach, so there is push to explore other approaches for our state structure)