Vault & Consul goroutine/http request hanging issues

Recently, we’ve noticed that our one of our more busy Vault clusters, the primary jumps from 2-3k goroutines, to over 50k and continues to grow until we restart the nodes. Additionally, we notice (via prometheus metrics) an http request that has a timeframe of 36 minutes, right before the start of the goroutine spike. Additionally, at the same time, we see a massive influx of http calls to Consul, however we see no change to the amount of incoming requests to the Vault cluster. The influx of Consul queries triggers Consul’s rate limiting, and seems to put the Consul cluster into an odd state, where it can no longer communicate with the other Consul nodes, until I restart the Consul nodes.

Overall, I’m seeing three issues:

  • Goroutine spike on primary.
  • Large increase in calls to Consul that don’t seem to be triggered by client Vault http requests.
  • Consul gets into an unrecoverable state.

I’m not exactly sure if this is more on Consul, than Vault.

  • If a large spike in goroutines, I’d expect to see the goroutine count go back down (or up) depending on traffic/requests/etc, but it looks to continue to grow.
  • If a large spike in HTTP requests, I’d expect rate limiting to kick in, and when the amount of requests dropped, for requests to process as-normal.

Vault server configuration file(s):

listener "tcp" {
  address = "0.0.0.0:8200"
  tls_cert_file = "/vault/etc/certs/ssl.cert"
  tls_key_file = "/vault/etc/certs/ssl.key"
  tls_min_version = "tls12"
  tls_disable_client_certs = true

  telemetry {
    unauthenticated_metrics_access = true
  }
}

storage "consul" {
  address = "127.0.0.1:8500"
  path    = "/vault/data/vault"
  datacenter = "<TRUNCATED>"
}

telemetry {
  prometheus_retention_time = "24h"
  disable_hostname = true
}

cluster_name = "<TRUNCATED>"
disable_mlock = true
ui = true

# 32 days
default_lease_ttl = "768h"

# 5 years
max_lease_ttl = "43800h"

default_max_request_duration = "20s" # added this after 2nd incident, but made no difference.

Consul server configuration file(s):

{
  "datacenter": "<TRUNCATED>",
  "data_dir": "/vault/data",
  "log_level": "INFO",
  "node_name": "<TRUNCATED>",
  "advertise_addr": "<TRUNCATED>",
  "leave_on_terminate": false,
  "rejoin_after_leave": true,
  "server": true,
  "bind_addr": "0.0.0.0",
  "client_addr": "0.0.0.0",
  "bootstrap_expect": 3,
  "telemetry": {
    "prometheus_retention_time": "60s",
    "disable_hostname": true
  },
  // Added query times, autopilot config, and limits after 2nd incident, didn't seem to make a difference.
  "default_query_time": "60s",
  "max_query_time": "300s",
  "autopilot": {
    "last_contact_threshold": "10s",
    "max_trailing_logs": 250,
    "server_stabilization_time": "15s"
  },
  "limits": {
    "http_max_conns_per_client": 400,
    "rpc_max_conns_per_client": 200
  },
  "rejoin_after_leave": true,
  "retry_join": [<TRUNCATED>],
  "start_join": [<TRUNCATED> ]
}

Additional context

We recently upgraded from Consul 1.0.0 (yes… I know, very old), to Consul 1.9.2, and around the exact same time, upgraded Vault from 1.6.1 to 1.6.2. One issue that this could be related to, is that we weren’t aware of the necessary stepped-based-upgrades we were supposed to do, and we upgraded directly from 1.0.0 to 1.9.2. Afterwards, we realized, however Consul was working as expected, Raft protocols were upgraded to 3, etc. All tests didn’t show any issues to performance or reliability.

We began noticing the Vault nodes in mention would all become unavailable around the same time. When accessing the UI, we’d see:

This is a standby Vault node but can't communicate with the active node via request forwarding. Sign in at the active node to use the Vault UI.

When reviewing Prometheus metrics, we would see vault_core_handle_request_sum / vault_core_handle_request_count showing a request making it about 36min, where the average would normally be < 3s.

Number of goroutines, the large spike being the primary (and the drop when I restarted all vault nodes):

Number of requests to Vault (according to Vault’s own metrics):

Number of requests to Consul (according to Consul’s own metrics):

The spikes at 7:15, 9:20, and 10:00, we don’t see the spike in Vault requests for these. When we restarted both Vault and Consul, I believe the 10:45 spike is the new Vault primary re-reading state, maybe?

In the Consul logs, one think I do notice is:

Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.356-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.381-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.389-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.412-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.414-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.440-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.479-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300

However, when testing network connectivity between the nodes, I have no issues making connections in both directions.

We see this in the Vault logs:

Mar 25 10:48:03 <TRUNCATED> vault[90847]: 2021-03-25T10:48:03.157-0400 [ERROR] core: forward request error: error="error during forwarding RPC request"
Mar 25 10:48:03 <TRUNCATED> vault[90847]: 2021-03-25T10:48:03.157-0400 [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing remote error: tls: internal error""
Mar 25 10:48:03 <TRUNCATED> vault[90847]: 2021-03-25T10:48:03.157-0400 [ERROR] core: forward request error: error="error during forwarding RPC request"
Mar 25 10:48:03 <TRUNCATED> vault[90847]: 2021-03-25T10:48:03.157-0400 [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing remote error: tls: internal error""
Mar 25 10:48:03 <TRUNCATED> vault[90847]: 2021-03-25T10:48:03.157-0400 [ERROR] core: forward request error: error="error during forwarding RPC request"
Mar 25 10:48:03 <TRUNCATED> vault[90847]: 2021-03-25T10:48:03.166-0400 [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing remote error: tls: internal error""
Mar 25 10:48:03 <TRUNCATED> vault[90847]: 2021-03-25T10:48:03.167-0400 [ERROR] core: forward request error: error="error during forwarding RPC request"
Mar 25 10:48:03 <TRUNCATED> vault[90847]: 2021-03-25T10:48:03.167-0400 [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing remote error: tls: internal error""
Mar 25 10:48:03 <TRUNCATED> vault[90847]: 2021-03-25T10:48:03.167-0400 [ERROR] core: forward request error: error="error during forwarding RPC request"
Mar 25 10:48:03 <TRUNCATED> vault[90847]: 2021-03-25T10:48:03.167-0400 [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing remote error: tls: internal error""
Mar 25 10:48:03 <TRUNCATED> vault[90847]: 2021-03-25T10:48:03.167-0400 [ERROR] core: forward request error: error="error during forwarding RPC request"
Mar 25 10:48:03 <TRUNCATED> vault[90847]: 2021-03-25T10:48:03.167-0400 [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing remote error: tls: internal error""

Anyone have any thoughts on what this could be, or what I could try? Could this be related to the incorrect Consul upgrades? Would it make sense to do a backup, wipe the Consul cluster, and restore? Could we downgrade Consul back down to 1.0.0, without breaking anything to eliminate that as a potential problem?

Can provide any additional info as necessary.

We are going to try and utilize Consul 1.9.4 and Vault 1.7.0 to see if either happens to resolve the issues we’re having, in the meantime. May attempt to increase the ratelimiting thresholds with Consul, though this may make the situation worse…

Almost forgot, here are some additional logs from the Consul server reporting connectivity issues…:

Mar 25 10:11:32 <TRUNCATED> consul[64832]: 2021-03-25T10:11:32.914-0400 [INFO]  agent: Synced node info
Mar 25 10:11:33 <TRUNCATED> consul[64832]: 2021-03-25T10:11:33.215-0400 [INFO]  agent: Synced service: service=vault:<TRUNCATED>:8200
Mar 25 10:11:33 <TRUNCATED> consul[64832]: 2021-03-25T10:11:33.369-0400 [INFO]  agent: Synced node info
Mar 25 10:11:33 <TRUNCATED> consul[64832]: 2021-03-25T10:11:33.739-0400 [INFO]  agent: Synced service: service=vault:<TRUNCATED>:8200
Mar 25 10:11:33 <TRUNCATED> consul[64832]: 2021-03-25T10:11:33.859-0400 [INFO]  agent: Synced check: check=vault:<TRUNCATED>:8200:vault-sealed-check
Mar 25 10:11:36 <TRUNCATED> consul[64832]: 2021-03-25T10:11:36.366-0400 [INFO]  agent: Newer Consul version available: new_version=1.9.4 current_version=1.9.2
Mar 25 10:11:55 <TRUNCATED> consul[64832]: 2021-03-25T10:11:55.128-0400 [WARN]  agent.server.memberlist.lan: memberlist: Refuting a suspect message (from: <TRUNCATED>)
Mar 25 10:11:57 <TRUNCATED> consul[64832]: 2021-03-25T10:11:57.069-0400 [WARN]  agent.server.memberlist.lan: memberlist: Was able to connect to <TRUNCATED> but other probes failed, network may be misconfigured
Mar 25 10:11:58 <TRUNCATED> consul[64832]: 2021-03-25T10:11:58.640-0400 [WARN]  agent.server.memberlist.lan: memberlist: Refuting a suspect message (from: <TRUNCATED>)
Mar 25 10:11:59 <TRUNCATED> consul[64832]: 2021-03-25T10:11:59.073-0400 [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: read tcp <TRUNCATED>:51186-><TRUNCATED>:8301: i/o timeout
Mar 25 10:12:01 <TRUNCATED> consul[64832]: 2021-03-25T10:12:01.085-0400 [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: read tcp <TRUNCATED>:51200-><TRUNCATED>:8301: i/o timeout
Mar 25 10:12:01 <TRUNCATED> consul[64832]: 2021-03-25T10:12:01.088-0400 [INFO]  agent.server.memberlist.lan: memberlist: Suspect <TRUNCATED> has failed, no acks received
Mar 25 10:12:03 <TRUNCATED> consul[64832]: 2021-03-25T10:12:03.070-0400 [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: read tcp <TRUNCATED>:51232-><TRUNCATED>:8301: i/o timeout
Mar 25 10:12:03 <TRUNCATED> consul[64832]: 2021-03-25T10:12:03.070-0400 [INFO]  agent.server.memberlist.lan: memberlist: Suspect <TRUNCATED> has failed, no acks received
Mar 25 10:12:04 <TRUNCATED> consul[64832]: 2021-03-25T10:12:04.951-0400 [INFO]  agent.server.serf.lan: serf: EventMemberFailed: <TRUNCATED> <TRUNCATED>
Mar 25 10:12:04 <TRUNCATED> consul[64832]: 2021-03-25T10:12:04.953-0400 [INFO]  agent.server: Removing LAN server: server="<TRUNCATED> (Addr: tcp/<TRUNCATED>:8300) (DC: cc-dev)"
Mar 25 10:12:05 <TRUNCATED> consul[64832]: 2021-03-25T10:12:05.081-0400 [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: read tcp <TRUNCATED>:51234-><TRUNCATED>:8301: i/o timeout
Mar 25 10:12:05 <TRUNCATED> consul[64832]: 2021-03-25T10:12:05.081-0400 [INFO]  agent.server.memberlist.lan: memberlist: Suspect <TRUNCATED> has failed, no acks received
Mar 25 10:12:05 <TRUNCATED> consul[64832]: 2021-03-25T10:12:05.489-0400 [WARN]  agent: Check missed TTL, is now critical: check=vault:<TRUNCATED>:8200:vault-sealed-check
Mar 25 10:12:08 <TRUNCATED> consul[64832]: 2021-03-25T10:12:08.091-0400 [INFO]  agent.server.serf.lan: serf: EventMemberJoin: <TRUNCATED> <TRUNCATED>
Mar 25 10:12:08 <TRUNCATED> consul[64832]: 2021-03-25T10:12:08.097-0400 [INFO]  agent.server: Adding LAN server: server="<TRUNCATED> (Addr: tcp/<TRUNCATED>:8300) (DC: cc-dev)"
Mar 25 10:12:15 <TRUNCATED> consul[64832]: 2021-03-25T10:12:15.058-0400 [WARN]  agent: Check missed TTL, is now critical: check=vault:<TRUNCATED>:8200:vault-sealed-check
Mar 25 10:12:15 <TRUNCATED> consul[64832]: 2021-03-25T10:12:15.365-0400 [WARN]  agent.server.memberlist.lan: memberlist: Refuting a suspect message (from: <TRUNCATED>)
Mar 25 10:12:23 <TRUNCATED> consul[64832]: 2021-03-25T10:12:23.069-0400 [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: read tcp <TRUNCATED>:51246-><TRUNCATED>:8301: i/o timeout
Mar 25 10:12:23 <TRUNCATED> consul[64832]: 2021-03-25T10:12:23.561-0400 [WARN]  agent.server.memberlist.lan: memberlist: Refuting a suspect message (from: <TRUNCATED>)
Mar 25 10:12:31 <TRUNCATED> consul[64832]: 2021-03-25T10:12:31.069-0400 [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: read tcp <TRUNCATED>:51252-><TRUNCATED>:8301: i/o timeout
Mar 25 10:12:31 <TRUNCATED> consul[64832]: 2021-03-25T10:12:31.070-0400 [INFO]  agent.server.memberlist.lan: memberlist: Suspect <TRUNCATED> has failed, no acks received
Mar 25 10:12:49 <TRUNCATED> consul[64832]: 2021-03-25T10:12:49.075-0400 [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: read tcp <TRUNCATED>:51264-><TRUNCATED>:8301: i/o timeout
Mar 25 10:12:49 <TRUNCATED> consul[64832]: 2021-03-25T10:12:49.076-0400 [INFO]  agent.server.memberlist.lan: memberlist: Suspect <TRUNCATED> has failed, no acks received
Mar 25 10:13:09 <TRUNCATED> consul[64832]: 2021-03-25T10:13:09.118-0400 [INFO]  agent: Synced check: check=vault:<TRUNCATED>:8200:vault-sealed-check
Mar 25 10:14:11 <TRUNCATED> consul[64832]: 2021-03-25T10:14:11.263-0400 [WARN]  agent: Check missed TTL, is now critical: check=vault:<TRUNCATED>:8200:vault-sealed-check
Mar 25 10:14:11 <TRUNCATED> consul[64832]: 2021-03-25T10:14:11.342-0400 [INFO]  agent: Synced check: check=vault:<TRUNCATED>:8200:vault-sealed-check
Mar 25 10:14:11 <TRUNCATED> consul[64832]: 2021-03-25T10:14:11.798-0400 [INFO]  agent: Synced check: check=vault:<TRUNCATED>:8200:vault-sealed-check
Mar 25 10:14:55 <TRUNCATED> consul[64832]: 2021-03-25T10:14:55.890-0400 [INFO]  agent: Synced check: check=vault:<TRUNCATED>:8200:vault-sealed-check
Mar 25 10:21:59 <TRUNCATED> consul[64832]: 2021-03-25T10:21:59.075-0400 [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: read tcp <TRUNCATED>:35800-><TRUNCATED>:8301: i/o timeout
Mar 25 10:21:59 <TRUNCATED> consul[64832]: 2021-03-25T10:21:59.081-0400 [INFO]  agent.server.memberlist.lan: memberlist: Suspect <TRUNCATED> has failed, no acks received
Mar 25 10:22:02 <TRUNCATED> consul[64832]: 2021-03-25T10:22:02.089-0400 [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: read tcp <TRUNCATED>:35806-><TRUNCATED>:8301: i/o timeout
Mar 25 10:22:02 <TRUNCATED> consul[64832]: 2021-03-25T10:22:02.091-0400 [INFO]  agent.server.memberlist.lan: memberlist: Suspect <TRUNCATED> has failed, no acks received
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.085-0400 [INFO]  agent.server.memberlist.lan: memberlist: Marking <TRUNCATED> as failed, suspect timeout reached (0 peer confirmations)
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.088-0400 [INFO]  agent.server.serf.lan: serf: EventMemberFailed: <TRUNCATED> <TRUNCATED>
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.090-0400 [INFO]  agent.server: Removing LAN server: server="<TRUNCATED> (Addr: tcp/<TRUNCATED>:8300) (DC: cc-dev)"
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.199-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.199-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.199-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.204-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.204-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.240-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.293-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.302-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.336-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.356-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.381-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.389-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.412-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.414-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.440-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.479-0400 [WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=<TRUNCATED>:8300
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.501-0400 [INFO]  agent.server.serf.lan: serf: EventMemberJoin: <TRUNCATED> <TRUNCATED>
Mar 25 10:22:03 <TRUNCATED> consul[64832]: 2021-03-25T10:22:03.502-0400 [INFO]  agent.server: Adding LAN server: server="<TRUNCATED> (Addr: tcp/<TRUNCATED>:8300) (DC: cc-dev)"
Mar 25 10:28:58 <TRUNCATED> consul[64832]: 2021-03-25T10:28:58.069-0400 [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: read tcp <TRUNCATED>:39842-><TRUNCATED>:8301: i/o timeout
Mar 25 10:28:58 <TRUNCATED> consul[64832]: 2021-03-25T10:28:58.075-0400 [INFO]  agent.server.memberlist.lan: memberlist: Suspect <TRUNCATED> has failed, no acks received
Mar 25 10:29:00 <TRUNCATED> consul[64832]: 2021-03-25T10:29:00.069-0400 [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: read tcp <TRUNCATED>:40002-><TRUNCATED>:8301: i/o timeout
Mar 25 10:29:00 <TRUNCATED> consul[64832]: 2021-03-25T10:29:00.069-0400 [INFO]  agent.server.memberlist.lan: memberlist: Suspect <TRUNCATED> has failed, no acks received
Mar 25 10:30:03 <TRUNCATED> consul[64832]: 2021-03-25T10:30:03.071-0400 [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: read tcp <TRUNCATED>:43624-><TRUNCATED>:8301: i/o timeout
Mar 25 10:30:03 <TRUNCATED> consul[64832]: 2021-03-25T10:30:03.075-0400 [INFO]  agent.server.memberlist.lan: memberlist: Suspect <TRUNCATED> has failed, no acks received
Mar 25 10:30:28 <TRUNCATED> consul[64832]: 2021-03-25T10:30:28.072-0400 [WARN]  agent.server.memberlist.lan: memberlist: Refuting a suspect message (from: <TRUNCATED>)
Mar 25 10:30:34 <TRUNCATED> consul[64832]: 2021-03-25T10:30:34.069-0400 [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: read tcp <TRUNCATED>:43642-><TRUNCATED>:8301: i/o timeout
Mar 25 10:30:34 <TRUNCATED> consul[64832]: 2021-03-25T10:30:34.070-0400 [INFO]  agent.server.memberlist.lan: memberlist: Suspect <TRUNCATED> has failed, no acks received

@lrstanley - don’t suppose you ever figured this out did you? I’ve got a cluster in the exact same state and I can’t recover it!