Terraform CDK slow to refresh state

Terraform is starting to take 10+ minutes to refresh state for ~350 resources. I’m running terraform version 1.3.7 and cdktf version 0.12.0.

 9:03AM /Users/paymahn/code/goldsky/goldsky-infra/cdktf.out  ✘ 130 dev ✭ ✱
 ❯❯❯ time npm run diff:prod
goldsky-infra-prod  Releasing state lock. This may take a few moments...

npm run diff:prod  60.28s user 8.25s system 10% cpu 11:17.10 total

I’ve found some old mentions of for_each being slow but I don’t see any instances of that in my .cdktf.out directory (rg is grep rewritten in rust):

 6:43PM /Users/paymahn/code/goldsky/goldsky-infra/cdktf.out dev ✭ ✱
 ❯❯❯ rg "for_each"
 6:43PM /Users/paymahn/code/goldsky/goldsky-infra/cdktf.out  ✘ 1 dev ✭ ✱

I also discovered the parallelism flag for terraform but when checking the cli docs for cdktf I see the following:

      --terraform-parallelism                   Forwards value as the `-parallelism` flag to Terraform. By default,
                                                the this flag is not forwarded to Terraform. Note: This flag is not
                                                supported by remote / cloud backend            [number] [default: -1]

I use the Terraform cloud state backend and when I experimented with passing --terraform-parallelism 50 to cdktf deploy I didn’t see any speedup.

Why does --terraform-parallelism not seem to speed things up and why doesn’t it work for remote/cloud backends? How can I improve the speed of refreshing state during a diff/deploy when using the terraform cdk?

Hi @naturita_ellertson :wave:

It is hard to reason about this without information about the underlying resources.
What kinds of resources are you using? For refreshing the resources, Terraform will do requests to the underlying cloud provider you might be using.

Could you share more about your Terraform config? Do the resources depend on each other or are they mostly independent? All of this has an influence on the time it takes to refresh your state.

I’m mostly using AWS resources like EKS, S3, IAM roles, etc. There are dependencies between my resources.

I tried spinning up a dev env as an EC2 instance and installing my CDKTF repo there with all the requisite dependencies and found that running a cdktf deploy from the EC2 instance to be much faster.

Is it possible my AWS CLI is being rate limited from my macbook but isn’t being rate limited when being used from an EC2 instance?

I would expect talking to AWS APIs to be faster if you run the calls from within AWS compared to from elsewhere on the Internet. I don’t know if any rate limits would be materially different, but the connections to the APIs will be over much shorter paths and are likely to be higher speed (EC2 instances can be on 10Gbps network connections).

When you say much faster, can you give some specifics? Are you talking many minutes of difference or a few seconds? How long do identical runs take in each location?

The diff time drops from ~10 minutes (from the original post) on my laptop (in Argentina without a VPN) to ~1.5 minutes in a Sao Paulo EC2 instance. My AWS resources are deployed in us-west2.

I just tried diffing on my laptop and got the following errors which are quite common:

[2023-02-05T15:10:24.600] [ERROR] default - ╷
│ Error: Get "https://78FBBFA50C5182DF54CBF222699F1025.gr7.us-west-2.eks.amazonaws.com/apis/apps/v1/namespaces/default/deployments/rpc-node-proxy": dial tcp connect: connection refused
│   with kubernetes_deployment.rpc-node-proxy_FE39A27F,
│   on cdk.tf.json line 9987, in resource.kubernetes_deployment.rpc-node-proxy_FE39A27F:
│ 9987:       },
goldsky-infra-prod  ╷
                    │ Error: Get "https://78FBBFA50C5182DF54CBF222699F1025.gr7.us-west-2.eks.amazonaws.com/apis/apps/v1/namespaces/default/deployments/rpc-node-proxy": dial tcp connect: connection refused
                    │   with kubernetes_deployment.rpc-node-proxy_FE39A27F (rpc-node-proxy/rpc-node-proxy),
                    │   on cdk.tf.json line 9987, in resource.kubernetes_deployment.rpc-node-proxy_FE39A27F (rpc-node-proxy/rpc-node-proxy):
                    │ 9987:       },

My internet connection is quite fast:

This error happened after seemingly successfully diffing 323 (of 332) resources (based on how many times I see Refreshing state... in the output of the diff).

Connection refusted isn’t an error you should generally get. For example if you were rate limited you’d see some HTTP errors rather than not being able to connect at all. Are there any proxies or security devices between you and the Internet which are maybe restricting access, or are overloaded?

The connectivity between your EC2 instance in Sao Paulo and us-west-2 don’t cross the Internet (as AWS has an extensive internal fibre backbone) so is unlikely to see issues. Maybe try to use a VPN to see if that bypasses whatever is causing you issues?

Unfortunately this sounds like a general Internet issue rather than something specific to AWS. One thing you could try is to run AWS CLI in a loop doing something like a S3 bucket check, to see how the speed varies/if any requests fail with errors.

I’ve tried running with a VPN and usually the VPN makes the issues worse. I typically see time outs while refreshing resources and seem to have gotten unlucky with the last run with the connection refused.