Drift dection root cause analysis

I haven’t been able to find information without signing up, so here it goes in case anyone knows the answer.

In drift detection, is Terraform Cloud able to scan the existing branches/PRs and find out where the drift might have been introduced?

In some shared environments, people might be testing changes in some PRs and it might not be clear who applied them. We can go to the git repo and scan the branches looking for commits that touched those files… so I was wondering if Terraform Cloud would make that task esier.

Hi @gtirloni,

The current form of drift detection is focused on detecting when the remote system has changed without a corresponding change to the Terraform configuration, and so from that direction the “problem” (if you consider it such) was caused outside of any version control process by changing something directly in the remote system outside of Terraform.

It sounds like you are interested in the opposite problem of the configuration changing in a way that you didn’t expect, and so you want to know where it changed and why. Is that right? If you can say a little more about the problem you are aiming to solve I’d be happy to pass this feedback on to the team which is building the drift detection features. Thanks!

The current form of drift detection is okay for our production environments because either the change was approved/merged/applied or, in case of an emergency fix, it’ll go through that process fairly quickly, and it’s easy to track where it came from, what’s caused it, etc. Usually, people will be aware of production issues and what was done to fix them. There’s extensive communication when things break.

For dev environments, we don’t care, so it’s not applicable. Developers can do whatever they want, break it, etc., since they usually own the standalone environment.

The issue comes from staging/test environments. Imagine the developer is working on some change (say, changing some ALB parameters in AWS). The dev environments allow some testing, but we’ll have more confidence once the change is applied to some staging/test environment that mimics production more closely.

The workflow I was referring to was when a developer, from their feature branch, applies a terraform configuration change to staging/test infrastructure that is not yet merged in the master branch.

They may quickly revert the change and be satisfied with the results. Or maybe they have to work with some other team to validate it, which could take days sometimes.

Now another developer works on an unrelated change and wants to test it in staging/test and then finds that configuration drift. They get blocked looking for where the change is coming from, if they made a mistake, if the master branch is broken and the infrastructure is correct or vice-versa, they have to ask several things that share ownership, etc. Worse case scenario they have to run terraform apply -target on hundreds of resources.

That configuration drift might be explained if we find a branch where resource X that is showing the configuration drift is touched. Something in the UI could say “look, I found a configuration drift and branches A/B/C have modifications to those resources/workspaces”. That would help a bit in narrowing down what’s happening.

That said, yes, it looks more like a process/management issue. Maybe we’re not using staging/test environments in the best possible way, our infrastructure could be improved, etc… The more I write about this scenario, the more I think that’s the case, but still, if a tool can ease the pain, then why not? :slight_smile:

Thanks for sharing those details, @gtirloni!

It does indeed sound like what you are hoping for is outside of the intended scope of the current drift detection feature, but I will pass this feedback on to the relevant team so that they can think about whether and how to incorporate this additional need into a future update.

Thanks again!

1 Like