The current status and challenges of Terraform refactoring for upgrading provider

Hi all.

I’m Masayuki Morita (a.k.a. @minamijoyo), a community contributor for Terraform ecosystem and an author of some third-party tools. I’m currently working on a refactoring tool for Terraform and an upgrade tool for AWS provider v4.

While the project status is WIP, I’m aware of many new challenges in refactoring required by upgrading provider. Let me share what I learned and I’m thinking of now.

Background

In February 2022, the AWS provider team has been released a new major version v4.0.0, which includes massive breaking changes of the aws_s3_bucket resource. Since it’s a fundamental component, this change affects most of users.

As one of Terraform AWS provider users, the v4 upgrade is so painful for me too. I understand why the breaking changes were needed and it’s necessary for long-term sustainability. It’s ok, but I have 60k+ lines of Terraform configurations including lots of aws_s3_bucket resources. It’s hard to refactor them by hand.

Fortunately, I’m an author of some Terraform related third-party tools, which include hcledit, tfupdate, tfmigrate, etc. It was natural for me to start writing a new project tfedit, which aims for easy refactoring Terraform configurations in a scalable way.

Although the initial goal of this project is providing a way for bulk refactoring of the aws_s3_bucket resource required by breaking changes in AWS provider v4, but the project scope is not limited to specific use-cases. It’s by no means intended to be an upgrade tool for all your providers. Instead of covering all you need, it provides reusable building blocks for Terraform refactoring and shows examples for how to compose them in real world use-cases.

While the project status is WIP, I’m aware of many new challenges. Let me explain separately for rewriting configurations and importing states.

Rewrite configurations

When I first read the v4 upgrade guide, my initial understanding was very simple, split an argument (e.g. acl) in the aws_s3_bucket resource to a new separated resource type (e.g. aws_s3_bucket_acl) and import it. I thought it’s relatively easy for me as the author of hcledit and tfmigrate. So, I wrote a small PoC and confirmed that it looks doable by adding more rules. However, as I implemented more and more rules, I realized that it’s not so simple problem as I expected.

Since the aws_s3_bucket is a very old resource type, it was violated the current AWS provider standardization guidelines such as naming convention of argument or structure of nested blocks. These issues were also fixed on this opportunity. This means that rewrite rules were not enough to simply split resources.

To make matters worse, some arguments were changed not only their names but also valid values. (e.g. true => “Enabled”). In this case, if a value of the argument is a variable, not literal, it’s impossible to automatically rewrite the value of the variable. It potentially could be passed from outside of module or even overwritten at runtime. In addition, some arguments cannot be converted correctly without knowing the current state of AWS resources. However, we shouldn’t expect as much as possible that an upgrade tool can make API calls because we can’t implicitly assume that the module author and user are the same.

If you curious how hard it is, see the actual implementation for the rewrite rule, the most complicated one for now is the aws_s3_bucket_lifecycle_rule, and known limitations which I’ve been already aware of. While I’m probably still miss something because it’s has not been well tested yet.

For rewriting configuration, the tfedit is heavily depends on the hclwrite parser in the hcl library, which is required to keep comments in existing configurations.

However, the current implementation of hclwrite has very limited capability and many features are missing. Here’s what I think missing features during implementation:

  • Add a block and an attribute in the middle of body
  • Format body in vertical
  • Get a value of attribute as string
  • Rename a reference in expression
  • Find and replace all references in body for renaming
  • Edit elements in list and object
  • Get comment attached to block and attribute
  • Insert comment before block and attribute
  • A type for dynamic block

Even though the above list is probably not exhaustive, the current functionality is very primitive. If we had more features, building an upgrade tool would be easier.

Import states

As you know, rewriting Terraform configuration is a half of the problem. We also need import all new resources. The moved block introduced in Terraform 1.1 doesn’t fit well for this case, because we need split a monolithic resource into multiple resources and old one still remains as a parent.

The first idea I came up with was to generate a migration file for import commands to be applied while rewriting Terraform configurations. This probably works in a simple case. However, when the aws_s3_bucket resource is defined inside a module, the module maintainer doesn’t know a full resource address of module instance. That is, we cannot generate a valid import command. Furthermore, the name of a s3 bucket can be passed from outside the module as an input variable, which is a unique identifier required for import. That is, in this case, all things the module maintainer could say is that import an unknown bucket to an unknown address. It doesn’t make sense at all.

The next idea I’m thinking of (but have not implemented yet) is to parse a Terraform plan file and generate a reverse migration file which includes import commands to be converged with no changes. It never be perfect for all resource types because some of them require a magic argument instead of a simple identifier. Its value is a string of multiple parameters concatenated with some delimiter, and the valid format depends on the resource type (e.g. aws_s3_bucket_acl requires the second argument as bucket-name,private). And the worse, its rule can’t be obtained from the provider’s schema metadata, I mean terraform providers schema -json. Having that said, I think it’s probably possible to generate import commands for specific resource types required by AWS v4 upgrade by hard coding some special rules except for some edge cases.

If we had a planable import and we could mark a resource as importable driven by configuration like the moved block, we might not need to generate an import command. Or, in this case, all we need is that import all new separated resources and expect to no changes. That is, if we had a new flag for plan / apply such as -import-only mode like -refresh-only, which allows us to import a new resource instead of creating a new one, we could even eliminate the step of marking a new resource as importable.

Ideally, it would be great if a provider itself could directly handle state upgrade rules on behalf of the user. I’m not sure how hard it is, but I guess that it’s not so simple in implementation because one resource could be split into multiple resources and some of them change structure of arguments as described above. In addition, The above problem is just an example for the AWS v4 upgrade. Different versions and providers will have different types of problems. I think the planable import would be a more convenient solution for various purposes.

Wrap up

Terraform refactoring is required not only due to module changes, but also provider changes, and I feel the later is a more difficult problem. Even thought breaking changes are inevitable to sustain and evolve provider for a long time, what Terraform can do today in this area is very primitive. It’s a new frontier for Terraform refactoring.

I hope this memo helps someone who interested in Terraform refactoring and help the Terraform community move forward.

Thanks!

4 Likes