I have a question about the size of remote state file. I am using S3 to store the terraform state file. The size of the state file is getting bigger and bigger since we add more and more components into our infrastructure. (Bigger than 5MB) I am curious, if there is any limit of the file size here? (I found this article about 2 years ago, Getting error when tfstate is larger than 4MB) Not sure if we will face any problem with a size increasing state file.
And for now, it takes longer to run “terraform plan” and “terraform apply”, I think this is because the size of state file is way bigger than the beginning (only a few KB). Every time, it will go through the whole state file. Any suggestion here to short the waiting time or improve the experience?
Indeed… Terraform itself does not enforce a fixed limit, but there are practical reasons to avoid letting a single state grow excessively large. What qualifies as “excessively large” is hard to pin down exactly because it depends on a bunch of subjective considerations, but it does sound like you’ve reached a point where the size has become an annoyance for you and so it’s probably worth planning to decompose into smaller configurations that each have their own state, exactly as @maxb suggested.
You specifically mentioned S3 and so the rest of this answer isn’t super relevant to you but I want to call it out because folks tend to find old questions via search when they are wondering similar things later.
Although Terraform itself does not enforce a hard limit, some of the remote systems targeted by Terraform’s state storage backends do have a remotely-enforced limit, which Terraform is then subject to.
For example, the HashiCorp Consul backend stores state as an object in Consul’s key/value store, which has a maximum object size of 512KiB at the time I’m writing this comment.
When deciding on a state storage backend, I recommend researching exactly how it uses the remote system it’s interacting with and then consulting that system’s documentation to find out what restrictions might apply. Along with maximum size limits, some remote systems also have request rate limits or may require authentication in a manner that could be inconvenient in an automated context.
For Amazon S3 in particular the maximum object size is currently 5TiB, and so you’re likely to hit other limits long before the object storage size limit becomes a problem.
@apparentlymart Thank you for your suggestion. Of course, I will not let state file increasing to 5TB definitely. But I will consider to both you and @maxb suggestion to break the large Terraform configuration into multiple smaller ones. Do you think keep a state file at around 10MB make sense? Or usually, what’s the best practice size to keep the state file? (My case is using cloud storage, like S3.)
I think the challenge here is that a “reasonable size” for a state isn’t typically measured in bytes, although of course if the overall size is the main concern then it would be.
More commonly I see folks observe that once they have a large number of resource instances in their configuration the “refresh” step across all of them starts to take a significant amount of time, and so for those people it’s the sheer number of resource instances that causes the problem-- or perhaps a number of resource instances of a particular type, if the remote API happens to be slow – rather than the size of the state file measured in bytes.
My recommendation would be to start by thinking about what might be reasonable architectural boundaries within your system, rather than focusing immediately on size. Some typical points of consideration are:
If the remote system you’re deploying into already has physical “failure domains” then it can make sense to split Terraform configurations along the same boundaries. For example, if you are using multiple regions in AWS then it might make sense to split your configurations by AWS region, because often AWS outages are per-region rather than global and if so you’ll be able to continue operating infrastructure in the regions that are still up and thus potentially use Terraform to respond to the outage.
You might try to split your system up into categories of infrastructure that tend to change together, or change at a similar cadence. For example, in many systems the virtual network infrastructure (VPCs and subnets in AWS) changes pretty infrequently whereas a particular service deployed into the VPC might change more often. In that case I might recommend splitting the VPC-related objects into a separate configuration which you will then work with only if you actually need to change the network topology, gateways, or route tables.
If you have multiple teams working on different parts of your infrastructure then it can make sense to try to draw boundaries that match each teams’ realm of responsibility, so that each team will typically be making changes that only affect one configuration at a time.
As you can see, all three of these examples are subjective and whether they will make sense for you will depend on details of how you work and how your infrastructure is designed. I can only offer general advice here because I can’t see inside your organization and your infrastructure, but I hope the above is a useful starting point.