The current status and challenges of Terraform refactoring for upgrading provider

Hi all.

I’m Masayuki Morita (a.k.a. @minamijoyo), a community contributor for Terraform ecosystem and an author of some third-party tools. I’m currently working on a refactoring tool for Terraform and an upgrade tool for AWS provider v4.

While the project status is WIP, I’m aware of many new challenges in refactoring required by upgrading provider. Let me share what I learned and I’m thinking of now.

Background

In February 2022, the AWS provider team has been released a new major version v4.0.0, which includes massive breaking changes of the aws_s3_bucket resource. Since it’s a fundamental component, this change affects most of users.

As one of Terraform AWS provider users, the v4 upgrade is so painful for me too. I understand why the breaking changes were needed and it’s necessary for long-term sustainability. It’s ok, but I have 60k+ lines of Terraform configurations including lots of aws_s3_bucket resources. It’s hard to refactor them by hand.

Fortunately, I’m an author of some Terraform related third-party tools, which include hcledit, tfupdate, tfmigrate, etc. It was natural for me to start writing a new project tfedit, which aims for easy refactoring Terraform configurations in a scalable way.

Although the initial goal of this project is providing a way for bulk refactoring of the aws_s3_bucket resource required by breaking changes in AWS provider v4, but the project scope is not limited to specific use-cases. It’s by no means intended to be an upgrade tool for all your providers. Instead of covering all you need, it provides reusable building blocks for Terraform refactoring and shows examples for how to compose them in real world use-cases.

While the project status is WIP, I’m aware of many new challenges. Let me explain separately for rewriting configurations and importing states.

Rewrite configurations

When I first read the v4 upgrade guide, my initial understanding was very simple, split an argument (e.g. acl) in the aws_s3_bucket resource to a new separated resource type (e.g. aws_s3_bucket_acl) and import it. I thought it’s relatively easy for me as the author of hcledit and tfmigrate. So, I wrote a small PoC and confirmed that it looks doable by adding more rules. However, as I implemented more and more rules, I realized that it’s not so simple problem as I expected.

Since the aws_s3_bucket is a very old resource type, it was violated the current AWS provider standardization guidelines such as naming convention of argument or structure of nested blocks. These issues were also fixed on this opportunity. This means that rewrite rules were not enough to simply split resources.

To make matters worse, some arguments were changed not only their names but also valid values. (e.g. true => “Enabled”). In this case, if a value of the argument is a variable, not literal, it’s impossible to automatically rewrite the value of the variable. It potentially could be passed from outside of module or even overwritten at runtime. In addition, some arguments cannot be converted correctly without knowing the current state of AWS resources. However, we shouldn’t expect as much as possible that an upgrade tool can make API calls because we can’t implicitly assume that the module author and user are the same.

If you curious how hard it is, see the actual implementation for the rewrite rule, the most complicated one for now is the aws_s3_bucket_lifecycle_rule, and known limitations which I’ve been already aware of. While I’m probably still miss something because it’s has not been well tested yet.

For rewriting configuration, the tfedit is heavily depends on the hclwrite parser in the hcl library, which is required to keep comments in existing configurations.

However, the current implementation of hclwrite has very limited capability and many features are missing. Here’s what I think missing features during implementation:

  • Add a block and an attribute in the middle of body
  • Format body in vertical
  • Get a value of attribute as string
  • Rename a reference in expression
  • Find and replace all references in body for renaming
  • Edit elements in list and object
  • Get comment attached to block and attribute
  • Insert comment before block and attribute
  • A type for dynamic block

Even though the above list is probably not exhaustive, the current functionality is very primitive. If we had more features, building an upgrade tool would be easier.

Import states

As you know, rewriting Terraform configuration is a half of the problem. We also need import all new resources. The moved block introduced in Terraform 1.1 doesn’t fit well for this case, because we need split a monolithic resource into multiple resources and old one still remains as a parent.

The first idea I came up with was to generate a migration file for import commands to be applied while rewriting Terraform configurations. This probably works in a simple case. However, when the aws_s3_bucket resource is defined inside a module, the module maintainer doesn’t know a full resource address of module instance. That is, we cannot generate a valid import command. Furthermore, the name of a s3 bucket can be passed from outside the module as an input variable, which is a unique identifier required for import. That is, in this case, all things the module maintainer could say is that import an unknown bucket to an unknown address. It doesn’t make sense at all.

The next idea I’m thinking of (but have not implemented yet) is to parse a Terraform plan file and generate a reverse migration file which includes import commands to be converged with no changes. It never be perfect for all resource types because some of them require a magic argument instead of a simple identifier. Its value is a string of multiple parameters concatenated with some delimiter, and the valid format depends on the resource type (e.g. aws_s3_bucket_acl requires the second argument as bucket-name,private). And the worse, its rule can’t be obtained from the provider’s schema metadata, I mean terraform providers schema -json. Having that said, I think it’s probably possible to generate import commands for specific resource types required by AWS v4 upgrade by hard coding some special rules except for some edge cases.

If we had a planable import and we could mark a resource as importable driven by configuration like the moved block, we might not need to generate an import command. Or, in this case, all we need is that import all new separated resources and expect to no changes. That is, if we had a new flag for plan / apply such as -import-only mode like -refresh-only, which allows us to import a new resource instead of creating a new one, we could even eliminate the step of marking a new resource as importable.

Ideally, it would be great if a provider itself could directly handle state upgrade rules on behalf of the user. I’m not sure how hard it is, but I guess that it’s not so simple in implementation because one resource could be split into multiple resources and some of them change structure of arguments as described above. In addition, The above problem is just an example for the AWS v4 upgrade. Different versions and providers will have different types of problems. I think the planable import would be a more convenient solution for various purposes.

Wrap up

Terraform refactoring is required not only due to module changes, but also provider changes, and I feel the later is a more difficult problem. Even thought breaking changes are inevitable to sustain and evolve provider for a long time, what Terraform can do today in this area is very primitive. It’s a new frontier for Terraform refactoring.

I hope this memo helps someone who interested in Terraform refactoring and help the Terraform community move forward.

Thanks!

4 Likes

Good stuff, thank you for sharing!

I’ve been working on a similar tool for refactoring Terraform modules, especially motivated by the upgrading for terraform-provider-azurerm from v3 to v4. Though I’m doing this in a slightly different idea. I have encountered the same pain points as you did. Especially, I’d like to call out the hclwrite is quite limited. For providers based on the terraform-plugin-framework, the attribute nested object is also not supported by it, which is quite sad…

I share some updates since April 2022, when I wrote the original post.

The AWSv4 upgrade tool was finally completed in June 2022. For the import command, I implemented parsing the plan file and generating the tfmigrate migration file.

As you know, the import block feature was added in Terraform v1.5. However, as of this writing, in the era of Terraform v1.9, import blocks can only be defined in root modules, not in child modules. This means that if the module maintainer and user are different, editing the tf file can solve only half of the problem.

Unfortunately, hclwrite functionality has not improved significantly since then. As you probably noticed, apparantlymart, who wrote most of the code for hclwrite, recently left HashiCorp. My prediction is that hclwrite is unlikely to improve in the near future.

Another possible related area is the LSP server (terraform-ls). Although some tickets for refactoring functionality seem to be a low priority, there is no ETA.

A third option worth mentioning is tflint’s autofix feature. They are going in a different direction. It does not rely on hclwrite, but implements modification of HCL files by rewriting the byte sequence contained in hcl.Range. The tflint has rulesets cut out as plugins, so you can add custom rules by implementing a tflint plugin. If your interest is the azurerm provider, you might consider adding auto-fix rules to the ruleset for azurerm provider. I have never implemented a tflint plugin myself, so maybe tflint lacks the API for what you want to do, but I think it is the most viable option to move forward as of September 2024 and is worth investigating.

Thank you for sharing more information! I didn’t notice Mart has left, that is a HUGE loss…

Echo the terraform-ls, its main developer radeksimko no longer works on that project, while some of the refactoring functionality that makes sense to the LS doesn’t mean it makes sense to the provider upgrading scenario, as the provider schema will be changed, which will break the assumption of the LS that the configurations matches the provider schema.

I haven’t looked into tflint before, but in my new project, I’m also doing two levels of rewrites, similar to what you have described how tflint did:

  1. Find out the HCL range of the target piece of HCL code, rewrite it via hclwrite
  2. Replace the original range with the updated HCL code

Even though, the hclwrite part is still too stark. It would be ideal if the hcl parsing library is as good as Go’s, which allows people to directly modify the AST and write back to source code (with comments kept as well).

Your two-phase rewrite is a reasonable and pragmatic approach under current technical constraints. It makes sense to me.

The repo is now public: GitHub - magodo/terrafix: A tool fixes user's terraform configurations to match the targeting provider's schema

Rewriting configurations on the provider side is an interesting approach. I have never thought of that.