The current form of drift detection is okay for our production environments because either the change was approved/merged/applied or, in case of an emergency fix, it’ll go through that process fairly quickly, and it’s easy to track where it came from, what’s caused it, etc. Usually, people will be aware of production issues and what was done to fix them. There’s extensive communication when things break.
For dev environments, we don’t care, so it’s not applicable. Developers can do whatever they want, break it, etc., since they usually own the standalone environment.
The issue comes from staging/test environments. Imagine the developer is working on some change (say, changing some ALB parameters in AWS). The dev environments allow some testing, but we’ll have more confidence once the change is applied to some staging/test environment that mimics production more closely.
The workflow I was referring to was when a developer, from their feature branch, applies a terraform configuration change to staging/test infrastructure that is not yet merged in the master branch.
They may quickly revert the change and be satisfied with the results. Or maybe they have to work with some other team to validate it, which could take days sometimes.
Now another developer works on an unrelated change and wants to test it in staging/test and then finds that configuration drift. They get blocked looking for where the change is coming from, if they made a mistake, if the master branch is broken and the infrastructure is correct or vice-versa, they have to ask several things that share ownership, etc. Worse case scenario they have to run terraform apply -target
on hundreds of resources.
That configuration drift might be explained if we find a branch where resource X that is showing the configuration drift is touched. Something in the UI could say “look, I found a configuration drift and branches A/B/C have modifications to those resources/workspaces”. That would help a bit in narrowing down what’s happening.
That said, yes, it looks more like a process/management issue. Maybe we’re not using staging/test environments in the best possible way, our infrastructure could be improved, etc… The more I write about this scenario, the more I think that’s the case, but still, if a tool can ease the pain, then why not? 