WARN addrConn.createTransport followed by Consul servers entering critical state

I started fiddling with a Consul cluster configuration today, and came across I failure I hadn’t seen before. Now, I seem to run into this issue even if I use the previously working configuration of the cluster.

I started seeing this issue in clusters where clients are failing to connect to the servers.

Server B

Attempts to connect to Server C.

Aug 16 22:16:18 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {us-west1-10.138.0.3:8300 gcp-rpc-cluster-servers-sp4p.us-west1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 10.138.0.4:0->10.138.0.3:8300: operation was canceled". Reconnecting...

Thereafter, the check is critical. Consul breaks.

Client A

Attempts to connect to one of Server A, B, or C.

Aug 16 22:21:56 [INFO]  agent: Discovered servers: cluster=LAN cluster=LAN servers="10.138.0.4 10.138.0.2 10.138.0.3"
Aug 16 22:21:56 [INFO]  agent: (LAN) joining: lan_addresses=[10.138.0.4, 10.138.0.2, 10.138.0.3]
Aug 16 22:21:59 [WARN]  agent.router.manager: No servers available

I am deploying on GCP using Terraform.

EDIT:
Upon client retry the following error is raised:

agent: (LAN) couldn't join: number_of_nodes=0 error="Serf can't Join after Leave or Shutdown"

Figured it out. Somehow some VPC names got swapped.

mind sharing more on the VPC issue that was causing that error, been really stucked on the error, its showing in bother consul-server pods and consul-client pods

2022-10-16T13:06:06.658Z [INFO]  agent: Joining cluster...: cluster=LAN
2022-10-16T13:06:06.658Z [INFO]  agent: (LAN) joining: lan_addresses=[consul-consul-server-0.consul-consul-server.plo.svc:8301, consul-consul-server-1.consul-consul-server.plo.svc:8301, consul-consul-server-2.consul-consul-server.plo.svc:8301]
2022-10-16T13:06:06.658Z [WARN]  agent.router.manager: No servers available
2022-10-16T13:06:06.658Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2022-10-16T13:06:06.758Z [WARN]  agent.client.memberlist.lan: memberlist: Failed to resolve consul-consul-server-0.consul-consul-server.plo.svc:8301: lookup consul-consul-server-0.consul-consul-server.plo.svc on 10.100.0.10:53: no such host
2022-10-16T13:06:06.774Z [WARN]  agent.client.memberlist.lan: memberlist: Failed to resolve consul-consul-server-1.consul-consul-server.plo.svc:8301: lookup consul-consul-server-1.consul-consul-server.plo.svc on 10.100.0.10:53: no such host
2022-10-16T13:06:07.255Z [WARN]  agent.router.manager: No servers available
2022-10-16T13:06:07.255Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:36786 error="No known Consul servers"
2022-10-16T13:06:08.780Z [WARN]  agent.client.memberlist.lan: memberlist: Failed to resolve consul-consul-server-2.consul-consul-server.plo.svc:8301: lookup consul-consul-server-2.consul-consul-server.plo.svc on 10.100.0.10:53: no such host
2022-10-16T13:06:08.780Z [WARN]  agent: (LAN) couldn't join: number_of_nodes=0 error="3 errors occurred:
        * Failed to resolve consul-consul-server-0.consul-consul-server.plo.svc:8301: lookup consul-consul-server-0.consul-consul-server.plo.svc on 10.100.0.10:53: no such host
        * Failed to resolve consul-consul-server-1.consul-consul-server.plo.svc:8301: lookup consul-consul-server-1.consul-consul-server.plo.svc on 10.100.0.10:53: no such host
        * Failed to resolve consul-consul-server-2.consul-consul-server.plo.svc:8301: lookup consul-consul-server-2.consul-consul-server.plo.svc on 10.100.0.10:53: no such host

"
2022-10-16T13:06:08.780Z [WARN]  agent: Join cluster failed, will retry: cluster=LAN retry_interval=30s error="3 errors occurred:
        * Failed to resolve consul-consul-server-0.consul-consul-server.plo.svc:8301: lookup consul-consul-server-0.consul-consul-server.plo.svc on 10.100.0.10:53: no such host
        * Failed to resolve consul-consul-server-1.consul-consul-server.plo.svc:8301: lookup consul-consul-server-1.consul-consul-server.plo.svc on 10.100.0.10:53: no such host
        * Failed to resolve consul-consul-server-2.consul-consul-server.plo.svc:8301: lookup consul-consul-server-2.consul-consul-server.plo.svc on 10.100.0.10:53: no such host

"
2022-10-16T13:06:11.963Z [WARN]  agent.router.manager: No servers available
2022-10-16T13:06:11.963Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:37104 error="No known Consul servers"
2022-10-16T13:06:21.933Z [WARN]  agent.router.manager: No servers available
2022-10-16T13:06:21.933Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:55054 error="No known Consul servers"
2022-10-16T13:06:24.006Z [WARN]  agent.router.manager: No servers available
2022-10-16T13:06:24.006Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2022-10-16T13:06:31.955Z [WARN]  agent.router.manager: No servers available
2022-10-16T13:06:31.955Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:54526 error="No known Consul servers"
2022-10-16T13:06:38.781Z [INFO]  agent: (LAN) joining: lan_addresses=[consul-consul-server-0.consul-consul-server.plo.svc:8301, consul-consul-server-1.consul-consul-server.plo.svc:8301, consul-consul-server-2.consul-consul-server.plo.svc:8301]
2022-10-16T13:06:38.789Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: ip-172-31-34-164.us-west-2.compute.internal 172.31.36.203
2022-10-16T13:06:38.789Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: consul-consul-server-2 172.31.21.131
2022-10-16T13:06:38.789Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: consul-consul-server-0 172.31.47.142
2022-10-16T13:06:38.790Z [INFO]  agent.client: adding server: server="consul-consul-server-2 (Addr: tcp/172.31.21.131:8300) (DC: dc1)"
2022-10-16T13:06:38.790Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: consul-consul-server-1 172.31.50.93
2022-10-16T13:06:38.790Z [INFO]  agent.client: adding server: server="consul-consul-server-0 (Addr: tcp/172.31.47.142:8300) (DC: dc1)"
2022-10-16T13:06:38.790Z [INFO]  agent.client: adding server: server="consul-consul-server-1 (Addr: tcp/172.31.50.93:8300) (DC: dc1)"
2022-10-16T13:06:38.790Z [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-172.31.21.131:8300 consul-consul-server-2 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->172.31.21.131:8300: operation was canceled". Reconnecting...
2022-10-16T13:06:38.851Z [INFO]  agent: (LAN) joined: number_of_nodes=3
2022-10-16T13:06:38.851Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3
2022-10-16T13:06:38.946Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: ip-172-31-57-221.us-west-2.compute.internal 172.31.52.120
2022-10-16T13:06:40.148Z [INFO]  agent: Synced node info
2022-10-16T13:15:57.624Z [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-172.31.50.93:8300 consul-consul-server-1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->172.31.50.93:8300: operation was canceled". Reconnecting...
2022-10-16T13:26:17.412Z [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-172.31.21.131:8300 consul-consul-server-2 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->172.31.21.131:8300: operation was canceled". Reconnecting...
2022-10-16T15:20:02.196Z [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-172.31.50.93:8300 consul-consul-server-1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->172.31.50.93:8300: operation was canceled". Reconnecting...

Wow, I forgot how terse my follow-up was. Sorry about that.

If I can recall correctly, it was that I was literally deploying nomad to the wrong VPC (one that did not have permissive enough firewall rules).

From what I saw in most threads related to this issue, this is rarely a Nomad configuration problem and more likely an indication that your surrounding network has not been configured properly.

Consul* configuration.