Working in teams with gitops - Rollback strategies

Hi there,
I use terraform for my whole infrastructure.
the current way is that the master branch represents the applied state of the production infrastructure.

So the idea is to evaluate everything in a dev environment and create a pull request when everything is ready.
In the pull request I can only run a terraform plan and some other checks, but I can not directly verify the apply.

So here comes the actual question:
How to deal with broken apply attempts?

It happened to me quite some time that an apply screwed up the infrastructure because of missing permissions, wrong dependencies, deadlocks, timeouts… All things that can not be caught during a plan.

Ideally now one would only revert that PR that broke things, but I also experienced situations where this was not possible and the only way to revert things was manual interaction from a local Laptop/PC.

What are strategies to resolve those situations in bigger teams? In my opinion the PR to master approach with a fast forward merege strategy is in theory a great way to organize and queue applies.

Would terraform cloud or terragrunt or any of these tools resolve this explicit issue and if so how? What other alternatives exist there? I found very little resources on how to handle the infrastructure in a team where multiple people can apply changes. For me the state file is only part of the equation. But maybe I am missing something here.

Thanks for your help!

Hi @schlumpfit,

Unfortunately as you have seen there are inevitably various problems that a Terraform provider cannot check for ahead of time which therefore cause failures during the apply phase. Provider developers typically try to minimize these but a provider is limited in what information it has available during planning and so the planning process is necessarily incomplete.

A common approach to mitigate this is to have a separate “staging” environment which matches the production environment as closely as possible (though often making some concessions to cost, operability, etc) and then apply changes in that environment first, as a sort of rehearsal for what might happen in production.

This is far from a perfect solution, of course. In particular, it might still leave you with an incomplete set of changes to clean up in the staging environment. But the idea is to separate those problems from problems in the production environment so that they can be handled with a different level of urgency.

Aside from adding an additional environment to rehearse in, it can also help to learn about how different kinds of changes affect Terraform and the remote system, so that you can either structure your changes to make them easier to revert, or know which situations might require you to roll back to a not-quite-identical previous configuration. For example:

  • Terraform needs a valid provider configuration for any provider that will have actions taken with it, including planning to destroy something that has been removed from the configuration. If you add a new provider configuration and resources for that provider in the same commit and the apply fails then you will probably need to revert only the resource blocks but retain the provider configuration block.
  • If you use refactoring features like moved blocks, you may need to move objects back to their original addresses if you want to revert the change that added those blocks, because the objects may have already moved to their new addresses before a failure occurred.
  • If you add a resource with a destroy-time provisioner then you would need to retain that provisioner in the configuration when reverting because otherwise it will be absent when the associated object is destroyed.

Since Terraform configuration is used in conjunction with remote stateful services, unfortunately typical application code approaches like just running git revert are often not sufficient. This is similar to how it can be hard to revert an application code change that involved changing the schema of a relational database: the database is something separate from the code and so the operator must decide how to deal with that and may need to “roll forward” to a new working version of the code, rather than reverting back to exactly the old code.

1 Like

Hi @apparentlymart,

thank you very much for your detailed answer. I was afraid that something like this would be the “general” answer. I have never worked in a big team which uses terraform collaboratively.

Most often I see people working from local setups with cli and only merge if the infrastructure was already applied.

I also get the idea of propagating the change via multiple stages/environments, but as you already mentioned this can be costly and/or long taking depending of the infrastructure.

I am wondering how bigger teams deal with these situations? Maybe some people can add their experiences or setups how they solved the collaboration in terraform.

From my experience, obviously the fastest way in the beginning is to only perform actions manually and collaborate (non-automated) with your colleagues and stick to conventions.

But this does not really scale.

On the other hand, a fully automated approach where any tooling like spacelift, atlantis, terrateam, terraform HCP etc … performs the plan and apply and goes threw all stages from dev to prod seems kind of slow and still can not guarantee that an apply will work in the last phase against prod (even though chances are hopefully quite high).

So this still leaves the question to me how do people deal with rollbacks or hotfixes (independent of the general workflow)? My guess is that the infra team will jump in and fix it quickly using a local cli.

One more thing that came to my mind is how to deal with multiple PRs? From what I understand is that atlantis for example just locks the folder/project for other atlantis attempts on the same resource. Are there any other options. Spinning up an entire dev enviroinment does not really seem to be feasible to me (opinionated).

Last but not least: What about dependencies between projects.

Best

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.