How should an AWS Cloud Engineer start learning Terraform effectively?

Hello everyone,
I’m currently working as an AWS Cloud Engineer and most of my infrastructure is already built and managed directly through the AWS Management Console. Recently, I realized the importance of Infrastructure as Code (IaC) and I want to start adopting Terraform to manage our AWS environment more systematically.

At the moment, our company is operating in three AWS regions, and the environment is already somewhat complex. This makes me wonder how to properly adopt Terraform without breaking existing setups.

Here are my main questions:

  • Should I begin with writing small Terraform modules from scratch, or should I first import existing AWS resources into Terraform state?

  • For a medium-sized environment (multiple VPCs, 20+ EC2 instances, RDS, S3, and WAF across 3 regions), what is a recommended directory structure or best practice to organize Terraform code?

  • Are there any learning paths, hands-on labs, or practical resources that you would recommend for someone coming from a “console-driven” background?

  • Finally, are there common pitfalls I should avoid when introducing Terraform into an existing AWS environment?

I’d really appreciate guidance, advice, or even examples from your own experience. Thanks in advance!

I would say that learning when / how to DRY stuff up and / or use modules is something you may develop instincts for over time. So, it may be smart to import stuff “resource-like” (i.e., just defining all the resources as they are) and refactor into modules later. If you have a lot of similar things, you can also do iteration directly in your terraform code (with for_each), without fully getting into creating modules.

Importing existing resources (that weren’t created with Terraform originally) has its own set of challenges, so this is another argument in favor of having your first pass be just getting the stuff into Terraform at all.

But, as you start importing stuff, you may see things where there’s a lot of repetition, and this may be a good place to start looking at using modules in one of a few ways

  • Module that can be instantiated once per actual resource
  • Module that packages up a bunch of related (different) resources
  • Module that uses a data structure (object, map, or set/list) passed into create a bunch of similar resources.

(you can also look at third party modules, which may be useful for some things, though may also have builtin complexity and configurability that your use case won’t demand)

I put some thoughts on modules broadly in this message:

This is probably one of the hardest things about using Terraform properly, and there isn’t a one-size-fits-all approach. Generally, I would say start somewhat flatter, and work on breaking things up more as it becomes risky or slow to apply changes. Even within that flat state, you can still use file names to help keep resources grouped together (for example, by region, application, or resource type, depending on what makes the most sense for you).

As states becomes more partitioned, you have a smaller blast radius, and can plan / apply things faster, but you introduce a couple of new problems

  • “drift” becomes easier to come up if individual states are planned less often
  • You now have to use outputs / remote state references (or data resources) to pass things defined in one state to another, or use other workarounds (like hard-coding) if you’re referencing things within one state from another.
  • You are also more likely to run into circular issues in terms of the order things need to be applied in, or needing to apply one state before being able to reference it from another.

The new “stacks” feature is probably worth a look. And, especially if you’ve got dev / prod / etc. environments in separate accounts or VPCs, and are stamping out similar resources within each one, looking at wrapper tools like Terragrunt is probably also worth it, even with the slight risks of adding another layer of abstraction.

If you have a really clear boundary in terms of “all these things belong to x team or y application”, you could look at breaking up state vertically that way.

In my experience, the more common case is that the boundaries are a little fuzzier, in which case, you might want to do nested states like account_id/region/vpc, or grouping related resources together in an “onion” model, with more foundational layers applied first. For example, 01 would be applied before 03, and you’d try to avoid having something in 01 reference something created in a higher numbered layer, but might frequently reference something (e.g., a VPC ID) created in a lower-numbered layer from a higher-numbered one.

aws/account_id/01_network – this is the most foundational
aws/account_id/03_storage - might contain s3 buckets
aws/account_id/05_database - might contain RDS instances
and so on. The idea here is that you’re (where and how to define IAM permissions then becomes another complicated situation).

If there are a lot of things that are shared by the various accounts (e.g., a DNS zone with lots of records) or with things that have a lot of relationships (for example, defining a bunch of VPCs / networks, and creating peering connections between them, or defining permissions that cut across accounts), I sometimes do a meta/ or shared/ directory and state structure.

I would suggest avoiding versioning modules to start with (for this size of environment), and keep all your configs / code in a single repo (but across multiple states), and matching the prefixes for the state to your repo’s filesystem layout.

One thing is that, while you could have a period of transition while some things are managed by Terraform and others not, you really want to avoid mixing clickops with IAC. So, maybe you focus on defining and importing some of the foundational items (like VPCs) first; this will build your experience and comfort with the tools, and maybe give you some more ideas about how you want to structure things, and reduce the effort involved with those refactors.

Similarly, some things may make sense to not manage with Terraform (for example, a cloud function that gets deployed by a CD system)… this can make sense sometimes, but try to avoid managing the same thing in two places, i.e., if you’re managing the resource via another tool, in most cases, avoid managing it with Terraform.

Esp. if you have more than 2-3 people, don’t be too afraid to ever do state surgery or do local applies, but I do strongly recommend finding something (whether it’s a tool like Atlantis, a TACO provider like Spacelift, or a simple homebrew CD pipeline) to make sure that most of your changes are applied in a standard way, and from some sort of pipeline vs. just local.

Setup good validation checks on your code and formatting (including using tools like tflint) in CI, as well as using pre-commit hooks, to catch mistakes / bugs earlier rather than later, and to help with overall readability and code quality.

1 Like

Thank you so much for the detailed explanation!
Your advice about starting simple without over-engineering modules and focusing first on importing existing resources into Terraform state makes a lot of sense. I also appreciate the warning about not mixing console changes with Terraform, that’s something I’ll keep in mind.

I’ll follow your suggestion to begin with a small part of the infrastructure, get it stable with Terraform, and then gradually move toward modules and pipeline integration.

Really appreciate you taking the time to share your experience — this helps me clarify my learning path a lot!
By the way, if you have any recommendations for high-quality Terraform courses or video tutorials that you found useful, I’d be very grateful if you could share them.

The most important thing is to shift EVERYONE on your team towards a Declarative mindset. A single person changing through the console can get into fights with Terraform as it reverts changes.

My approach for that is:

  1. Clearly divide areas of responsibility (like certain accounts’s networking layers) between Terraform managed and human managed.
    1. If necessary, for terraform managed areas create three sets of roles: terraform’s, engineer’s regular role (mostly view, can’t change terraform’s stuff), super–override which allows bypassing terraform (and should rarely be used).
    2. Make sure it’s clear to everyone this isn’t punitive but is to keep all of us from interfering with terraform out of habit.
  2. Pick a new area where there isn’t anything. I particularly like the networking and/or KMS of a new AWS account since it’s almost impossible to change later.
  3. Have people start shifting to the new paradigm there. No manual changes ever.
  4. Give each terraform author or ops person their own AWS account as their sandbox to play with terraform. Eventually your most proliferate may need more accounts to model multi-account things like Peer Exchanges
  5. As your team becomes confident, grow out from that base. Maybe you give databases, ALB, or ASG (most complex) to terraform, maybe you adjust your networking & KMS modules to run under other AWS accounts
  6. Midway, maybe you have ALB & ASG modules you’re confident in. You can use them to build new, terraform managed ALB or ASG and their bits in parallel while live. Test them, then slide over to the new ones w/o disruption (a bit tricky to have 0 downtime but doable).
  7. After more of this, you have much of your footprint under terraform. But there are things like pre-existing VPC & KMS you can’t recreate.
    1. You can go through the process of importing them but the results will rarely match the things terraform created and your standards. I strongly dislike such exceptions and avoid this.
    2. Or, once your have everything in that type of account terraformed and proven for other, newer accounts—consider building a clean, new account with just terraform and perform a migration from the hand-built to terraformed. It may help to piggyback this on another migration or database upgrade.
    3. You also may be able to justify the rebuild/migration on a security basis:
      1. The new one may have some better features
      2. Terraform will apply the rules consistently, whereas there may be exceptions missed when done by hand.
2 Likes