Intermittent connectivity to registry.terraform.io

Hi!

Since last Wednesday, we’ve been experiencing intermittent connectivity from our Azure DevOps pipelines to registry.terraform.io. This has occurred for two different IPv4 addresses at Fastly (151.101.29.183 and 151.101.30.49), and affects (my rough estimate) about 3-5% of our connection attempts. This usually means at least 1-2 failures per run of terraform in our environment.

This manifests in our terraform runs in a few different ways:

I/O timeouts:

Initializing provider plugins...
- Finding microsoft/azuredevops versions matching "~> 0.1.1"...
...
- Installing hashicorp/null v2.1.2...
- Installing terraform-providers/bitbucket v1.2.0...
- Installed terraform-providers/bitbucket v1.2.0 (signed by HashiCorp)
Partner and community providers are signed by their developers.
If you'd like to know more about provider signing, you can read about it here:
https://www.terraform.io/docs/plugins/signing.html
Error: Failed to install provider
Error while installing hashicorp/null v2.1.2: Get
"https://releases.hashicorp.com/terraform-provider-null/2.1.2/terraform-provider-null_2.1.2_linux_amd64.zip":
dial tcp 151.101.29.183:443: i/o timeout

Error verifying checksum:

Error verifying checksum for provider "azurerm"
The checksum for provider distribution from the Terraform Registry
did not match the source. This may mean that the distributed files
were changed after this version was released to the Registry.

Error: unable to verify checksum

Registry service unreachable:

Initializing provider plugins...
- Checking for available provider plugins...
Registry service unreachable.
This may indicate a network issue, or an issue with the requested Terraform Registry.

Error: registry service is unreachable, check https://status.hashicorp.com/ for status updates

We have firewall logs and packet captures, and they’re pretty uninteresting. They show SYN retransmissions in the TCP handshake, usually for two connections in sequence to the same IP, then normal connectivity to that IP resumes. This usually occurs after several successful connections to the same IP.

We have investigated our environment thoroughly over the last few days on the assumption that it was our problem, but our firewalls are showing no signs of session exhaustion, and this doesn’t affect any other sites - our pipelines make use of Ubuntu, Docker, Kubernetes, Helm, and various container registries, including mcr.microsoft.com, all of which are working perfectly; this behaviour is only persistent on these two Fastly IPs.

I suspect either:

  • Connectivity issues between Azure (Australia East region) and Fastly, or
  • Rate limiting of our public IP address by Fastly and/or Hashicorp.

How can we get through to the right operational teams to get this looked at?

Thanks in advance,
Paul

Hi Paul,

Thanks for getting in touch about these issues. I’m very sorry that you’re experiencing these run failures. We don’t enforce any rate limits in Fastly or on behalf of the upstream Terraform Registry, and we haven’t seen an increase in errors in Fastly over the past week.

One thing that I noticed in your logs is that the issue does not seem to be unique to registry.terraform.io – one of the timeouts is from downloading the hashicorp/null provider from releases.hashicorp.com, which is a different service that is also fronted by Fastly. Failures with the same IP address could indicate either a problem with a Fastly edge server or a networking issue.

I’d ask for you to collect some more information to determine whether the problem is specific to HashiCorp services. You may need to use some network-level debugging tools (e.g. traceroute in a loop to various IPs) to figure out if the connection is failing at a specific hop, whether it is random or at certain times, etc.

Cheers,
Jeff

Thanks Jeff. I’ve just checked, and from our Azure environment, releases.hashicorp.com resolves to one of the same IPs as registry.terraform.io.

As I mentioned earlier, we’re successfully using external resources from Ubuntu, Docker, Kubernetes, Helm, and Microsoft (both container registry and package repos), which is why I believe this is specific to either the path between Azure and Fastly in Sydney, or to something else behind Fastly’s frontend network.

Azure doesn’t support traceroute internally, and I’ve never been successful in getting anything useful from outbound traceroutes - happy to be pointed to a better tool than mtr --tcp -P 443 -bw4 registry.terraform.io if one exists. :slight_smile:

I’ve just redirected DNS in our Azure environment to point registry.terraform.io to 151.101.2.49 (one of the addresses I get when querying from my laptop - appears to be in Brisbane), and releases.hashicorp.com to 151.101.1.183 (same deal). We’ll see if we continue to get the same errors.

Hi @paulgear,

Thanks for the reply. It might be helpful to run mtr in a loop, something like:

while true; do mtr --tcp -P 443 -rnc 60 registry.terraform.io 2>&1 > $(date +%s).trace; done

It sounds like nmap may be another option worth considering.

Hopefully those DNS changes will resolve the issue! Feel free to reach out with any other questions.

Cheers,
Jeff

Hi Jeff,

Perhaps I was not clear about the traceroute issue - what I was saying is that mtr doesn’t produce anything useful at all when run from inside an Azure VNet. Here’s an example:

azureuser@pgtest1:~$ mtr --tcp -P 443 -c5 -bw4 registry.terraform.io
Start: 2021-03-22T04:50:33+0000
HOST: pgtest1       Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ???           100.0     5    0.0   0.0   0.0   0.0   0.0
  2.|-- ???           100.0     5    0.0   0.0   0.0   0.0   0.0
  3.|-- ???           100.0     5    0.0   0.0   0.0   0.0   0.0
  4.|-- ???           100.0     5    0.0   0.0   0.0   0.0   0.0
  5.|-- ???           100.0     5    0.0   0.0   0.0   0.0   0.0
  6.|-- ???           100.0     5    0.0   0.0   0.0   0.0   0.0
  7.|-- ???           100.0     5    0.0   0.0   0.0   0.0   0.0
  8.|-- ???           100.0     5    0.0   0.0   0.0   0.0   0.0
  9.|-- 151.101.82.49  0.0%     5    1.2   1.5   1.2   1.7   0.2

I’m not sure how nmap would help diagnose anything problems between our instance and registry.terraform.io.

For the record, the DNS changes did not resolve the issue, and I’m now assuming once again that there’s a problem on our end. I don’t have a good explanation for why other sites work and terraform’s doesn’t, but I suspect it’s something to do with the number and throughput of the connections to registry.terraform.io.

We’ll keep investigating.

Regards,
Paul

Hi Paul, I think I have a similar issue with a gitlab runner on an azure vm, did you find something on your side ?

Hi Frederic,

I’m sorry, but we never found a solution to this; after a few days of frantic changes on our part, it just went away. We ended up on exactly the same configuration we started on, with no clue as to why it started or stopped. Our best guess at the end of it was that our runner VMs were allocated to a faulty Azure VM host.

Regards,
DenverCoder9 :smiley:

Hi @frederic.husson,

I have heard in other forums today that folks are having trouble with Azure VM hosts not being able to reach some of the services Terraform uses to install providers. So far this seems limited to just requests from Azure and so I believe some users opened help tickets with Azure support about it, since Azure support is in a better position to debug connectivity issues from their networks.

Unfortunately I don’t have any specific information about what is going on, but I do know that folks at HashiCorp are aware of it and monitoring to try to determine why this problem seems to be affecting only Azure VMs.