I have a cluster of 3 servers running, each with a persistent volume attached and the data directory is on it. I then killed one server and the autoscaling group brought up a new server (10.27.54.6
), mounted the same volume from the killed server, and tried to join the cluster. However, it ended up throwing a bunch of No cluster leader
errors and couldn’t join.
From the logs that I paste below, you can see that it discovered the other two live nodes first (10.27.52.37
, 10.27.53.68
), started joining:
agent: (LAN) joining: lan_addresses=[10.27.54.6, 10.27.52.37, 10.27.53.68]
looks like it joined:
agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3
then it printed out some synced service (the IP in this does not exist anymore and probably belonged to a server that was rotated out before):
agent: Synced service: service=vault:10.27.54.216:8200
Then it was trying to join some previously known nodes on LAN:
agent.server.serf.wan: serf: Attempting re-join to previously known node: ip-10-27-54-109...
agent.server.serf.lan: serf: Attempting re-join to previously known node: ip-10-27-54-216...
agent.server.serf.lan: serf: Attempting re-join to previously known node: ip-10-27-54-149.us-west-2.compute.internal...
none of which exists anymore. In between, I do see it trying to join one of the live ones:
agent.server.serf.lan: serf: Attempting re-join to previously known node: ip-10-27-52-37.us-west-2.compute.internal: 10.27.52.37:8301
and was able to join:
agent.server.serf.lan: serf: Re-joined to previously known node: ip-10-27-52-37.us-west-2.compute.internal: 10.27.52.37:8301
At this point, since we already have 2 servers (10.27.52.37
, 10.27.53.68
) that formed a cluster with 10.27.52.37
as leader, I wouldn’t expect it to try to do a leader election. But it seems that this node 10.27.54.6
only sees 10.27.52.37
and tried to kick off leader election:
agent.server.raft: heartbeat timeout reached, starting election: last-leader=10.27.52.37:8300
agent.server.raft: entering candidate state: node="Node at 10.27.54.6:8300 [Candidate]" term=8502
agent.server.raft: Election timeout reached, restarting election
agent.server.raft: entering candidate state: node="Node at 10.27.54.6:8300 [Candidate]" term=8503
and then I saw the dreaded:
agent.http: Request error: method=GET url=/v1/kv/vault/core/lock from=127.0.0.1:58578 error="No cluster leader"
I suspect that the node has persisted some stale view of previous known hosts (all those dead ones) and somehow that is interfering with the new node coming up.
Here’s the full log:
# journalctl -f -u consul
-- Logs begin at Tue 2021-03-02 17:36:36 UTC. --
2021-04-14T19:16:16.028Z [WARN] agent: The 'ui' field is deprecated. Use the 'ui_config.enabled' field instead.
2021-04-14T19:16:16.028Z [WARN] agent: Node name "ip-10-27-54-6.us-west-2.compute.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2021-04-14T19:16:16.028Z [WARN] agent: bootstrap_expect > 0: expecting 3 servers Apr 14 19:16:16 ip-10-27-54-6.us-west-2.compute.internal chpst[31466]: 2021-04-14T19:16:16.037Z [WARN] agent.auto_config: skipping file /mnt/apps/consul/config/consul.ca, extension must be .hcl or .json, or config format must be set
2021-04-14T19:16:16.037Z [WARN] agent.auto_config: skipping file /mnt/apps/consul/config/consul.crt, extension must be .hcl or .json, or config format must be set
2021-04-14T19:16:16.037Z [WARN] agent.auto_config: skipping file /mnt/apps/consul/config/consul.key, extension must be .hcl or .json, or config format must be set
2021-04-14T19:16:16.037Z [WARN] agent.auto_config: The 'acl_datacenter' field is deprecated. Use the 'primary_datacenter' field instead.
2021-04-14T19:16:16.037Z [WARN] agent.auto_config: The 'ui' field is deprecated. Use the 'ui_config.enabled' field instead.
2021-04-14T19:16:16.037Z [WARN] agent.auto_config: Node name "ip-10-27-54-6.us-west-2.compute.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2021-04-14T19:16:16.037Z [WARN] agent.auto_config: bootstrap_expect > 0: expecting 3 servers
2021-04-14T19:16:17.792Z [INFO] agent.server.raft: restored from snapshot: id=8482-52383256-1618427004408
2021-04-14T19:16:17.820Z [INFO] agent.server.raft: initial configuration: index=52387055 servers="[{Suffrage:Voter ID:34fe974a-dd7e-a82d-54aa-f4c36fb2ca8a Address:10.27.52.37:8300} {Suffrage:Nonvoter ID:09b8e65b-cf0e-4578-c78a-e2737ef5b0eb Address:10.27.53.68:8300}]"
2021-04-14T19:16:17.821Z [INFO] agent.server.raft: entering follower state: follower="Node at 10.27.54.6:8300 [Follower]" leader=
2021-04-14T19:16:17.821Z [INFO] agent.server.serf.wan: serf: EventMemberJoin: ip-10-27-54-6.us-west-2.compute.internal.vault-wei 10.27.54.6
2021-04-14T19:16:17.824Z [INFO] agent.server.serf.lan: serf: EventMemberJoin: ip-10-27-54-6.us-west-2.compute.internal 10.27.54.6
2021-04-14T19:16:17.824Z [INFO] agent.router: Initializing LAN area manager
2021-04-14T19:16:17.826Z [INFO] agent: Started DNS server: address=127.0.0.1:8600 network=udp Apr 14 19:16:17 ip-10-27-54-6.us-west-2.compute.internal chpst[31466]: 2021-04-14T19:16:17.827Z [INFO] agent.server.serf.wan: serf: Attempting re-join to previously known node: ip-10-27-54-233.us-west-2.compute.internal.vault-wei: 10.27.54.233:8302
2021-04-14T19:16:17.827Z [INFO] agent.server.serf.lan: serf: Attempting re-join to previously known node: ip-10-27-54-233.us-west-2.compute.internal: 10.27.54.233:8301
2021-04-14T19:16:17.828Z [INFO] agent.server: Adding LAN server: server="ip-10-27-54-6.us-west-2.compute.internal (Addr: tcp/10.27.54.6:8300) (DC: vault-wei)"
2021-04-14T19:16:17.828Z [INFO] agent.server: Raft data found, disabling bootstrap mode
2021-04-14T19:16:17.829Z [INFO] agent.server: Handled event for server in area: event=member-join server=ip-10-27-54-6.us-west-2.compute.internal.vault-wei area=wan
2021-04-14T19:16:17.829Z [INFO] agent: Started DNS server: address=127.0.0.1:8600 network=tcp
2021-04-14T19:16:17.831Z [INFO] agent: Starting server: address=127.0.0.1:8500 network=tcp protocol=http
2021-04-14T19:16:17.831Z [INFO] agent: Starting server: address=127.0.0.1:8501 network=tcp protocol=https
2021-04-14T19:16:17.832Z [WARN] agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry { disable_compat_1.9 = true }` to disable them.
2021-04-14T19:16:17.832Z [INFO] agent: Retry join is supported for the following discovery methods: cluster=LAN discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
2021-04-14T19:16:17.833Z [INFO] agent: Joining cluster...: cluster=LAN
2021-04-14T19:16:17.834Z [INFO] agent: discover-aws: Address type is not supported. Valid values are {private_v4,public_v4,public_v6}. Falling back to 'private_v4': cluster=LAN
2021-04-14T19:16:17.834Z [INFO] agent: discover-aws: Region not provided. Looking up region in metadata...: cluster=LAN
2021-04-14T19:16:17.833Z [INFO] agent: started state syncer
==> Consul agent running!
2021-04-14T19:16:17.836Z [INFO] agent: discover-aws: Region is us-west-2: cluster=LAN
2021-04-14T19:16:17.837Z [INFO] agent: discover-aws: Filter instances with ReplicaResourceId=vault-wei: cluster=LAN
2021-04-14T19:16:18.021Z [INFO] agent: discover-aws: Instance i-043af3cc8de188114 has private ip 10.27.54.6: cluster=LAN
2021-04-14T19:16:18.021Z [INFO] agent: discover-aws: Instance i-08e7c7acc017852f4 has private ip 10.27.52.37: cluster=LAN
2021-04-14T19:16:18.021Z [INFO] agent: discover-aws: Instance i-0d8c8c35b5d4931f0 has private ip 10.27.53.68: cluster=LAN
2021-04-14T19:16:18.021Z [INFO] agent: Discovered servers: cluster=LAN cluster=LAN servers="10.27.54.6 10.27.52.37 10.27.53.68"
2021-04-14T19:16:18.021Z [INFO] agent: (LAN) joining: lan_addresses=[10.27.54.6, 10.27.52.37, 10.27.53.68]
2021-04-14T19:16:18.024Z [INFO] agent.server.serf.lan: serf: EventMemberJoin: ip-10-27-53-68.us-west-2.compute.internal 10.27.53.68
2021-04-14T19:16:18.025Z [INFO] agent.server.serf.lan: serf: EventMemberJoin: ip-10-27-52-37.us-west-2.compute.internal 10.27.52.37
2021-04-14T19:16:18.025Z [WARN] agent.server.memberlist.lan: memberlist: Refuting a suspect message (from: ip-10-27-54-6.us-west-2.compute.internal)
2021-04-14T19:16:18.025Z [INFO] agent.server: Adding LAN server: server="ip-10-27-53-68.us-west-2.compute.internal (Addr: tcp/10.27.53.68:8300) (DC: vault-wei)"
2021-04-14T19:16:18.025Z [INFO] agent.server: Adding LAN server: server="ip-10-27-52-37.us-west-2.compute.internal (Addr: tcp/10.27.52.37:8300) (DC: vault-wei)"
2021-04-14T19:16:18.028Z [INFO] agent: (LAN) joined: number_of_nodes=3
2021-04-14T19:16:18.028Z [INFO] agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3
2021-04-14T19:16:18.137Z [WARN] agent.server.raft: failed to get previous log: previous-index=52389681 last-index=52387075 error="log not found"
2021-04-14T19:16:18.179Z [WARN] agent: Node info update blocked by ACLs: node=f56d7f64-dbcf-506b-5fa3-115b6011226c accessorID=00000000-0000-0000-0000-000000000002
2021-04-14T19:16:18.180Z [INFO] agent.server.serf.lan: serf: EventMemberUpdate: ip-10-27-54-6.us-west-2.compute.internal
2021-04-14T19:16:18.180Z [INFO] agent.server: Updating LAN server: server="ip-10-27-54-6.us-west-2.compute.internal (Addr: tcp/10.27.54.6:8300) (DC: vault-wei)"
2021-04-14T19:16:18.186Z [INFO] agent: Synced service: service=vault:10.27.54.216:8200
2021-04-14T19:16:18.198Z [INFO] agent: Synced service: service=vault:10.27.54.6:8200
2021-04-14T19:16:18.425Z [INFO] agent.server.serf.wan: serf: EventMemberUpdate: ip-10-27-54-6.us-west-2.compute.internal.vault-wei
2021-04-14T19:16:18.425Z [INFO] agent.server: Handled event for server in area: event=member-update server=ip-10-27-54-6.us-west-2.compute.internal.vault-wei area=wan
2021-04-14T19:16:18.491Z [WARN] agent: Node info update blocked by ACLs: node=f56d7f64-dbcf-506b-5fa3-115b6011226c accessorID=00000000-0000-0000-0000-000000000002
2021-04-14T19:16:18.497Z [INFO] agent: Synced service: service=vault:10.27.54.216:8200
2021-04-14T19:16:18.502Z [INFO] agent: Synced service: service=vault:10.27.54.6:8200
2021-04-14T19:16:20.889Z [INFO] agent.server.serf.wan: serf: Attempting re-join to previously known node: ip-10-27-54-109.us-west-2.compute.internal.vault-wei: 10.27.54.109:8302
2021-04-14T19:16:20.889Z [INFO] agent.server.serf.lan: serf: Attempting re-join to previously known node: ip-10-27-54-216.us-west-2.compute.internal: 10.27.54.216:8301
2021-04-14T19:16:22.826Z [WARN] agent: Check missed TTL, is now critical: check=vault:10.27.54.216:8200:vault-sealed-check
2021-04-14T19:16:23.961Z [INFO] agent.server.serf.lan: serf: Attempting re-join to previously known node: ip-10-27-54-149.us-west-2.compute.internal: 10.27.54.149:8301
2021-04-14T19:16:23.961Z [INFO] agent.server.serf.wan: serf: Attempting re-join to previously known node: ip-10-27-54-149.us-west-2.compute.internal.vault-wei: 10.27.54.149:8302
2021-04-14T19:16:27.033Z [INFO] agent.server.serf.lan: serf: Attempting re-join to previously known node: ip-10-27-52-37.us-west-2.compute.internal: 10.27.52.37:8301
2021-04-14T19:16:27.033Z [INFO] agent.server.serf.wan: serf: Attempting re-join to previously known node: ip-10-27-54-216.us-west-2.compute.internal.vault-wei: 10.27.54.216:8302
2021-04-14T19:16:27.036Z [INFO] agent.server.serf.lan: serf: Re-joined to previously known node: ip-10-27-52-37.us-west-2.compute.internal: 10.27.52.37:8301
2021-04-14T19:16:30.105Z [INFO] agent.server.serf.wan: serf: Attempting re-join to previously known node: ip-10-27-54-64.us-west-2.compute.internal.vault-wei: 10.27.54.64:8302
2021-04-14T19:16:33.177Z [INFO] agent.server.serf.wan: serf: Attempting re-join to previously known node: ip-10-27-54-104.us-west-2.compute.internal.vault-wei: 10.27.54.104:8302
2021-04-14T19:16:36.249Z [WARN] agent.server.serf.wan: serf: Failed to re-join any previously known node
2021-04-14T19:16:40.431Z [WARN] agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002
2021-04-14T19:16:44.095Z [INFO] agent: Newer Consul version available: new_version=1.9.4 current_version=1.9.3
2021-04-14T19:16:55.722Z [WARN] agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002
2021-04-14T19:17:00.901Z [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader=10.27.52.37:8300
2021-04-14T19:17:00.901Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.27.54.6:8300 [Candidate]" term=8502
2021-04-14T19:17:08.393Z [WARN] agent.server.raft: Election timeout reached, restarting election
2021-04-14T19:17:08.393Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.27.54.6:8300 [Candidate]" term=8503
2021-04-14T19:17:09.921Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/vault/core/lock from=127.0.0.1:58578 error="No cluster leader"
2021-04-14T19:17:09.995Z [ERROR] agent.http: Request error: method=PUT url=/v1/session/renew/e62f764e-670c-40ef-d91e-2bd8125b7417 from=127.0.0.1:58592 error="No cluster leader"
2021-04-14T19:17:10.464Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/vault/core/lock from=127.0.0.1:58626 error="No cluster leader"
2021-04-14T19:17:10.547Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/vault/core/lock from=127.0.0.1:58628 error="No cluster leader"
2021-04-14T19:17:10.904Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/vault/core/lock from=127.0.0.1:58630 error="No cluster leader"