Why does `-refresh=false` not disable refresh of data sources?

maxb · October 31, 2022, 2:30pm

In the olden days of Terraform 0.12, -refresh=false used to disable refreshing data sources. Since Terraform 0.13, it has no effect on data sources.

Does anyone know why this change was made?

I am investigating options for upgrading a legacy configuration, which uses a lot of data sources to look up (GitHub) groups by name, prior to using their IDs in other resources.

The relationship between names and IDs is static once created, and the data is cached in the state file anyway.

The change from looking these up only once (with -refresh=false) to looking them up again on every plan, is really a big deal, causing slowness and API rate limit issues.

Can anyone provide insight into why there is no option now to have pre-existing data sources just re-use the data cached in the Terraform state?

jbardin · November 1, 2022, 12:52pm

Hi @maxb,

All data sources need to be read because they can be used within provider configurations, which in turn may need the information to complete a successful plan.

In old versions of terraform, all resources were read during an entirely separate refresh phase, so data sources individually were not able to be read if that phase were skipped altogether. Now that refreshing has been incorporated into the plan, we can ensure data resources are up to date even when the reads on managed resources are being skipped.

maxb · November 1, 2022, 1:29pm

Thank you for answering! But…

Sure, but if there are already values stored in the state, it seems like they could satisfy this requirement?

Sure, no argument about this now being possible… But what about when it’s desirable to not do that?

Not refreshing managed resources is an optimisation provided for users who know drift is unlikely or impossible, and are willing to sacrifice drift detection, to gain faster runs with less remote API use.

Is there any conceptual reason, not to extend the users the same opt-out optimisation capability for re-reading data sources?

jbardin · November 1, 2022, 2:27pm

Data sources are read from the provider, not from state during the plan. We technically can’t guarantee that existing data source state could be decoded, because data sources have no schema upgrade mechanism. Managed resources have a protocol with which the existing state is upgraded to match the current schema before plan-time decoding. In order to do that in a reliable manner with data sources, a new protocol would need to be created and supported by providers to allow the decoding of data source state created with unknown schemas.

apparentlymart · November 1, 2022, 3:25pm

Further to this: we keep the previous data source values in the state primarily for investigation/debugging purposes, such as with the following commands to see what was read:

terraform show
terraform console

If Terraform didn’t save them then the results would be discarded immediately after the run and so there would be no way to inspect them to see if the result is what you expected when debugging a problem.

But future runs of Terraform cannot rely on that for the reasons previously mentioned. Although both refreshing a managed resource and reading a data source seem mechanically similar, they are conceptually distinct: Terraform tracks managed resources as a sort of cache of the remote object so the provider has something to start from when making subsequent requests, but a provider never gets to see the “prior state” of a data resource: it must always make every request anew based only on what’s in the configuration, because it represents a dependency on an external object rather than something managed in this Terraform configuration.

The original implementation of data sources was buggy in that it tried to reuse stale data source data when refresh was disabled, due to disabling the entire refresh phase rather than just disabling the updates for cached managed resources. We fixed that bug (along with a number of others with a similar root cause) by combining refresh and plan into a single operation.

Now if you disable refreshing then the graph nodes representing managed resources themselves skip their own refresh step, instead just asking the provider to upgrade the stored state to the latest schema version. But that setting does not affect anything except managed resources, because those are the only things which have a meaningful concept of being “refreshed”.

maxb · November 2, 2022, 12:00am

Thank you - this really helps understand why things are the way they are.

It’s surprising, considering how similar resources and data sources look to a Terraform user - even an advanced user looking at the contents of the state JSON!

It’s a bit of a shame really, as it makes data sources considerably less palatable to use at scale (lets say you need to configure 5 different groups with access to 500 Git repositories, and 50 groups on another 10 … suddenly you’re doing 3000 API operations just to resolve group names to IDs, a mostly unchanging mapping, on every single Terraform run).

I guess I could cheat with a custom provider, that implements the lookup semantics you’d expect to be data source, but write it as a managed resource instead.

apparentlymart · November 2, 2022, 1:49am

I think the real answer to this valid complaint is to extend the provider protocol so that providers can somehow declare that they are able to coalesce certain lookups into a single batch request, and then have Terraform Core detect those opportunities and ask the provider only a single question to get many results.

If by “git repositories” you mean GitHub then the GitHub provider in particular seems like it would have particularly good opportunities for batching using GitHub’s GraphQL API, because request size limits aside you can in principle batch together any combination of queries into a single call.

This idea has been around for a long time but it’s behind various other work for our team which maintains the provider development libraries, and there’s no point in Terraform Core supporting it if there’s no API for provider developers to use it. The current umbrella issue for that (and various other coalescing/batching/caching/etc scenarios) is here, though:

github.com/hashicorp/terraform-plugin-sdk

Determine provider SDK strategy for minimizing API calls

opened 12:57PM - 11 Jun 19 UTC

radeksimko

enhancement upstream-protocol

## Data sources Data sources currently can't take advantage of Etag-based (or… really any other) caching. Whenever a data source is refreshed, the SDK does not read any existing state, so etag is ignored, even if it's stored there as a computed attribute. ## Resources While resources can technically be cached on Etag basis, I think we could do better and make it easier for arbitrary APIs which follow the convention and use the `Etag` header, perhaps through some helper functions that can be plugged in as HTTP transports or something like that. To start with we could adopt https://github.com/terraform-providers/terraform-provider-github/blob/2eaf170dbd1efa403d3bce08ade826c827a4063b/github/transport.go#L21-L40 in some form to the SDK.

I’d love to get to this eventually, since I agree that the current one-request-per-block design does make it hard to use providers where typical uses involve hundreds of instances of the same or similar resource types / data sources.

If you do want to write a custom provider though, you could potentially write one which offers a single data source that just takes a big GitHub GraphQL query and the parameter values for that query and returns the result. Then you could batch together as many requests as make sense given the dependencies between the objects into one lookup, and have it still be a data resource.

maxb · November 3, 2022, 12:08am

There are rather a lot of ways Terraform’s core could be changed to address different subsets of this class of challenges, indeed.

1) Coalesce duplicate data source lookups within Terraform core

Without any change to the provider protocol at all, one incremental improvement might be to identify data source blocks which are exactly the same.

This would come into play when you have a configuration consisting of many module instances (e.g. provisioning a bunch of GitHub repositories, using a module to encapsulate certain conventions), and inside each of those module instances there is a data source looking up the IDs of groups to be permissioned access on the repository … except the same groups (e.g. “all-developers-in-my-org”) are used in many module instances.

2) Providers able to handle multiple related data lookups in a batch

i.e. Your suggestion from the start of your message

It could work, but it’s so dependent on the underlying infrastructure API having a viable “multi-lookup” API, that I fear it won’t help much for anything that’s plain REST.

3) Going back to near the start of this topic - what if Terraform DID gain support for reading previous data source results out of state?

I know, you weren’t very enthusiastic about this one… but would it actually be prohibitively difficult to implement?

The values are already persisted to the state - what if the logic was:

If -refresh=false is set (or a new more nuanced option if preferred)…
And there is a previous value stored in the state…
And the stored value is compatible with the schema the current version of the provider advertises for the data source type…
Then just don’t bother asking the provider to perform the read
But if any of the above restrictions weren’t true, just do the read from the provider again

What really attracts me to this idea, is that:

It aligns well with the general desires of a user who would specify -refresh=false (they’re saying, Terraform, just trust the world won’t change underneath you, in my environment - and ideally Terraform would apply that general concept to both data sources and managed resources)
As far as I can tell, looking at terraform-provider-tls for an example of a terraform-plugin-framework provider with data sources, it seems like this can be done without needing the providers to change?
It drops the number of API calls to infrastructure APIs, in a steady state, when only minor updates are being done to a large existing configuration, far more than just batching but still reading on every run, can do.

apparentlymart · November 3, 2022, 3:09pm

I can see why the last idea you shared would be attractive in your situation where you’ve designed your system around the original behaviour of data sources, but our research indicates that a significant number of authors don’t consider it strange for Terraform to detect changes to data sources while planning because the primary purpose of that feature is to respond to changes outside of the configuration: it effectively says “if this other thing changes, dynamically change my configuration in response so I don’t have to”.

That’s different than disabling refresh because when using Terraform robustly it is often valid to assume that a managed resource will still be the same as it was last time, because nothing should be changing those objects except the current Terraform configuration. The refreshing Terraform does by default is just in case something weird has happened, and some teams prefer to set things up so that weird things cannot happen and then turn off refreshing to speed up planning, because they trust it will never yield anything useful.

Our research into batching (many years ago now, unfortunately) suggested that enough APIs commonly used with Terraform had some means of batching that should benefit at least situations involving reading a number of objects of the same type, and in some cases reading many objects of different types (as is the case for any GraphQL API, but also with multipart request proxy endpoints wrapping some REST APIs).

You are right that it can’t solve everything, but I don’t think it really needs to: there are certain object types whose usage patterns tend to encourage large numbers of objects, such as anything which scales with number of people in an organisation. But there are also plenty of things where a typical configuration only interacts with one or a few objects of the same type, and batching would offer only a modest improvement for those anyway.

For your existing modules today, it seems like you might benefit from some refactoring so that your shared modules accept as input the relevant results of loading the groups, rather than each one reading the same information. In other words, that’s manually implementing the sort of batching you described in the first point within your module, rather than Terraform doing that automatically.

It would be interesting to explore Terraform doing that automatically, and it might even just come as a nice side-effect of batching because Terraform Core would already need to be comparing multiple pending reads to notice when they are batchable, and noticing that two are exactly identical is in theory an easy special case to implement once we’re already comparing and bucketing all of the reads anyway.

But in today’s Terraform it remains an author’s responsibility to trade off convenience vs. performance, just as is true in many other languages: while it is often useful to encapsulate all of the queries a given component needs inside that component, developers often need to compromise to centralise the lookups of some commonly-used values and pass them in as inputs to the other components, even though that does weaken the encapsulation by exposing which components are depending on that shares data.

Topic		Replies	Views
Data source doesn't refresh contents while planning Terraform vault	2	353	June 8, 2022
Terraform Plan Data Source Refresh Output Terraform	1	2974	March 24, 2021
Terraform 0.13 - handling of data source - Data resource reads can no longer be disabled by -refresh=false Terraform	9	6789	December 8, 2020
Datasources are removed from Terraform state file when we run refresh Terraform	2	458	November 29, 2022
Refresh data source after instance creation Terraform	3	1029	October 11, 2022

Why does `-refresh=false` not disable refresh of data sources?

Related topics