"Configuration Driven Development" - did I paint myself into a corner?

mfcooney · September 16, 2020, 4:03pm

Hello all,

I’m a software engineer, so when I took on our company’s Terraform project, I looked at it as a software project, rather than a scripting project, that needed to be handed off to other teams. I took a ‘configuration driven development’ approach, where I described the full and final system in the .tfvars file and then built a framework behind it. This allows our delivery teams to build out complex environments without needing to have any knowledge of AWS or Terraform – but they get our best practices baked in.

The attached file, perf.tfvars (at the bottom), is a typical configuration file. It has objects for describing the systems and AMIs needed, but then has an environment variable with an array of ‘subsystems’, which are essentially security groups in AWS (by keeping the names generic, I can keep the same configuration for Azure, etc. and just build out a supporting framework for it). Each subsystem describes the inputs, machines, scaling groups, etc. that it needs.

I then have 20+ modules that walk this structure in turn, pulling out the bits they care about and building them out. By maintaining a careful naming structure for everything, I can keep doing lookups on other modules to get the id/names I need to make it work. I included two modules as samples. One is for security groups, which is very simple. It walks the structure, creates the array of everything that needs to be created and then creates the resources. The other file is the security group rules and shows that it can get pretty salty as it has to pull security group rules from multiple places in the array in various formats. It does multiple lookups on the established security groups to makes sure the rules are assigned properly. Even being one of the more complex modules though, it’s still only 100 lines or so. It all works extremely well with my team, who has a fully defined and understood infrastructure.

The problem I’m running into however is when I need to make modifications to an infrastructure. It seems like Terraform keeps all this information more or less in an array like I create it and if something in the middle of the array needs to be adjusted, everything from there down needs to be destroyed and recreated. So, for example, if I have 10 security groups and need to change a rule in the 4th, Terraform wants to destroy security groups 5 thru 10 as well (and everything related). That tends to upset the other teams to where they don’t want to use this framework (they’re still developing their infrastructure and need to make iterative improvements regularly).

I’d like to know, generally speaking, if the approach I’m using is conducive to being able to modify pieces within the array without destroying the rest of it in the process. For example, I’m wondering if I should query the infrastructure each time for security group resources rather than passing that data from module to module, if that would help. Or maybe this is the wrong approach altogether for modifications and I should start over with a more ‘scripted’ approach.

Hopefully this post is detailed enough to understand what I’m trying to do. I’m just looking for a thumbs up/down on this approach in general from the people that have been working with Terraform longer than I have.

Files: perf.tfvars.txt (5.1 KB) security-group-rules.tf.txt (4.4 KB) security-groups.tf.txt (1.1 KB)

Thank you,

Michael

apparentlymart · September 16, 2020, 5:29pm

Hi @mfcooney,

From looking at the configuration examples you shared, it seems like the crux of the problem you are describing is that this configuration uses count to produce multiple instances of a resource.

The original intent of count was to create multiple “copies” of what would be functionally the same object. For example, five identical EC2 instances that all run the same software, and where destroying one of them is functionally the same as destroying any other one.

Terraform historically had no other repetition construct though, and so it became typical to use count in situations like you showed here, where the goal is only to reduce repetition in the configuration itself and not to declare that these multiple objects are all fungible. As you’ve seen, that creates a problem when you want to modify the set, because Terraform considers them all to be fungible and so it doesn’t correctly understand your intent.

Recent versions of Terraform introduced resource for_each as a new alternative to count that is intended to handle situations like the one you’ve described here, where you want to create several instances of the same resource that are each functionally distinct from the others. It then allows you to add, remove, and modify in-place existing instances, without necessarily disturbing others.

The underlying mechanism for this is to give each of the instances a unique string key which Terraform will use to identify it, which replaces the numeric indices used for count. Because these unique keys are under your control, you can ensure that they capture what makes each of the instances distinct, and so if the input changes in future in a way that introduces a new key then Terraform will understand that as the intent to create a new instance, without any changes to existing instances.

Switching to for_each from your current count-based configuration is not trivial but also not impossible. The tricky part is that if you need to do it without disturbing the existing remote objects then you’ll need to explicitly tell Terraform the new identities of those instances in your state, so that Terraform won’t see it as a request to destroy the existing instances and create new ones:

terraform state mv 'aws_security_group_rule.security_group_rule[0]' 'aws_security_group_rule.security_group_rule_cidr["10.1.0.0/24 80 80"]'

In the above, for the sake of example only, I’ve assumed that the unique keys for aws_security_group_rule.security_group_rule_cidr would be a concatenation of the cidr blocks, the “from port”, and the “to port”, though I expect from reading your configuration that you might also want to include things like the protocol.

Another thing I would add here is that I’d typically recommend against just wrapping the full functionality of an existing resource type. If the input to this module is just the full, raw capabilities of aws_security_group_rule then it could be more straightforward to just have the users of the module write their own resource "aws_security_group_rule" blocks directly. You mentioned this module encoding best practices, which can be a good reason to write a shared module, but it might be worth taking a look at the Module Composition section of the docs to see if you can find opportunities to decompose into smaller modules that can be combined together in different ways, and thus use modules as the abstraction rather than trying to handle everything within a single .tfvars.

I’m not meaning to suggest that any of these approaches are universally right or wrong: as usual in systems design, there is no magic bullet. I do hope, though, that this gives some additional ideas to consider when you are making your system design tradeoffs.

mfcooney · September 16, 2020, 5:52pm

This is fantastic, @apparentlymart, thanks for taking the time to write this out. I looked at the resource for_each and it almost looks like they saw me coming with that one. I’m going to play with that and see how it works (fortunately, I don’t have to deal with existing infrastructure). I’ll also read through the Modules Composition page and will consider that as well - technically everything is one module right now with 20+ modules under that. That may be too aggressive - maybe each environment should be a separate module. Something to noodle on…

Thank you very much!