New server with the same persistent volume can't join cluster

I have a cluster of 3 servers running, each with a persistent volume attached and the data directory is on it. I then killed one server and the autoscaling group brought up a new server (10.27.54.6), mounted the same volume from the killed server, and tried to join the cluster. However, it ended up throwing a bunch of No cluster leader errors and couldn’t join.

From the logs that I paste below, you can see that it discovered the other two live nodes first (10.27.52.37, 10.27.53.68), started joining:

agent: (LAN) joining: lan_addresses=[10.27.54.6, 10.27.52.37, 10.27.53.68]

looks like it joined:

agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3

then it printed out some synced service (the IP in this does not exist anymore and probably belonged to a server that was rotated out before):

agent: Synced service: service=vault:10.27.54.216:8200

Then it was trying to join some previously known nodes on LAN:

agent.server.serf.wan: serf: Attempting re-join to previously known node: ip-10-27-54-109...
agent.server.serf.lan: serf: Attempting re-join to previously known node: ip-10-27-54-216...
agent.server.serf.lan: serf: Attempting re-join to previously known node: ip-10-27-54-149.us-west-2.compute.internal...

none of which exists anymore. In between, I do see it trying to join one of the live ones:

agent.server.serf.lan: serf: Attempting re-join to previously known node: ip-10-27-52-37.us-west-2.compute.internal: 10.27.52.37:8301

and was able to join:

agent.server.serf.lan: serf: Re-joined to previously known node: ip-10-27-52-37.us-west-2.compute.internal: 10.27.52.37:8301

At this point, since we already have 2 servers (10.27.52.37, 10.27.53.68) that formed a cluster with 10.27.52.37 as leader, I wouldn’t expect it to try to do a leader election. But it seems that this node 10.27.54.6 only sees 10.27.52.37 and tried to kick off leader election:

agent.server.raft: heartbeat timeout reached, starting election: last-leader=10.27.52.37:8300
agent.server.raft: entering candidate state: node="Node at 10.27.54.6:8300 [Candidate]" term=8502
agent.server.raft: Election timeout reached, restarting election
agent.server.raft: entering candidate state: node="Node at 10.27.54.6:8300 [Candidate]" term=8503

and then I saw the dreaded:

agent.http: Request error: method=GET url=/v1/kv/vault/core/lock from=127.0.0.1:58578 error="No cluster leader"

I suspect that the node has persisted some stale view of previous known hosts (all those dead ones) and somehow that is interfering with the new node coming up.

Here’s the full log:

# journalctl -f -u consul
-- Logs begin at Tue 2021-03-02 17:36:36 UTC. --
2021-04-14T19:16:16.028Z [WARN]  agent: The 'ui' field is deprecated. Use the 'ui_config.enabled' field instead.
2021-04-14T19:16:16.028Z [WARN]  agent: Node name "ip-10-27-54-6.us-west-2.compute.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2021-04-14T19:16:16.028Z [WARN]  agent: bootstrap_expect > 0: expecting 3 servers                                                                                                                                                                                                                  Apr 14 19:16:16 ip-10-27-54-6.us-west-2.compute.internal chpst[31466]: 2021-04-14T19:16:16.037Z [WARN]  agent.auto_config: skipping file /mnt/apps/consul/config/consul.ca, extension must be .hcl or .json, or config format must be set
2021-04-14T19:16:16.037Z [WARN]  agent.auto_config: skipping file /mnt/apps/consul/config/consul.crt, extension must be .hcl or .json, or config format must be set
2021-04-14T19:16:16.037Z [WARN]  agent.auto_config: skipping file /mnt/apps/consul/config/consul.key, extension must be .hcl or .json, or config format must be set
2021-04-14T19:16:16.037Z [WARN]  agent.auto_config: The 'acl_datacenter' field is deprecated. Use the 'primary_datacenter' field instead.
2021-04-14T19:16:16.037Z [WARN]  agent.auto_config: The 'ui' field is deprecated. Use the 'ui_config.enabled' field instead.
2021-04-14T19:16:16.037Z [WARN]  agent.auto_config: Node name "ip-10-27-54-6.us-west-2.compute.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2021-04-14T19:16:16.037Z [WARN]  agent.auto_config: bootstrap_expect > 0: expecting 3 servers
2021-04-14T19:16:17.792Z [INFO]  agent.server.raft: restored from snapshot: id=8482-52383256-1618427004408
2021-04-14T19:16:17.820Z [INFO]  agent.server.raft: initial configuration: index=52387055 servers="[{Suffrage:Voter ID:34fe974a-dd7e-a82d-54aa-f4c36fb2ca8a Address:10.27.52.37:8300} {Suffrage:Nonvoter ID:09b8e65b-cf0e-4578-c78a-e2737ef5b0eb Address:10.27.53.68:8300}]"
2021-04-14T19:16:17.821Z [INFO]  agent.server.raft: entering follower state: follower="Node at 10.27.54.6:8300 [Follower]" leader=
2021-04-14T19:16:17.821Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: ip-10-27-54-6.us-west-2.compute.internal.vault-wei 10.27.54.6
2021-04-14T19:16:17.824Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: ip-10-27-54-6.us-west-2.compute.internal 10.27.54.6
2021-04-14T19:16:17.824Z [INFO]  agent.router: Initializing LAN area manager
2021-04-14T19:16:17.826Z [INFO]  agent: Started DNS server: address=127.0.0.1:8600 network=udp                                                                                                                                                                                                     Apr 14 19:16:17 ip-10-27-54-6.us-west-2.compute.internal chpst[31466]: 2021-04-14T19:16:17.827Z [INFO]  agent.server.serf.wan: serf: Attempting re-join to previously known node: ip-10-27-54-233.us-west-2.compute.internal.vault-wei: 10.27.54.233:8302
2021-04-14T19:16:17.827Z [INFO]  agent.server.serf.lan: serf: Attempting re-join to previously known node: ip-10-27-54-233.us-west-2.compute.internal: 10.27.54.233:8301
2021-04-14T19:16:17.828Z [INFO]  agent.server: Adding LAN server: server="ip-10-27-54-6.us-west-2.compute.internal (Addr: tcp/10.27.54.6:8300) (DC: vault-wei)"
2021-04-14T19:16:17.828Z [INFO]  agent.server: Raft data found, disabling bootstrap mode
2021-04-14T19:16:17.829Z [INFO]  agent.server: Handled event for server in area: event=member-join server=ip-10-27-54-6.us-west-2.compute.internal.vault-wei area=wan
2021-04-14T19:16:17.829Z [INFO]  agent: Started DNS server: address=127.0.0.1:8600 network=tcp
2021-04-14T19:16:17.831Z [INFO]  agent: Starting server: address=127.0.0.1:8500 network=tcp protocol=http
2021-04-14T19:16:17.831Z [INFO]  agent: Starting server: address=127.0.0.1:8501 network=tcp protocol=https
2021-04-14T19:16:17.832Z [WARN]  agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry { disable_compat_1.9 = true }` to disable them.
2021-04-14T19:16:17.832Z [INFO]  agent: Retry join is supported for the following discovery methods: cluster=LAN discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
2021-04-14T19:16:17.833Z [INFO]  agent: Joining cluster...: cluster=LAN
2021-04-14T19:16:17.834Z [INFO]  agent: discover-aws: Address type  is not supported. Valid values are {private_v4,public_v4,public_v6}. Falling back to 'private_v4': cluster=LAN
2021-04-14T19:16:17.834Z [INFO]  agent: discover-aws: Region not provided. Looking up region in metadata...: cluster=LAN
2021-04-14T19:16:17.833Z [INFO]  agent: started state syncer
==> Consul agent running!
2021-04-14T19:16:17.836Z [INFO]  agent: discover-aws: Region is us-west-2: cluster=LAN
2021-04-14T19:16:17.837Z [INFO]  agent: discover-aws: Filter instances with ReplicaResourceId=vault-wei: cluster=LAN
2021-04-14T19:16:18.021Z [INFO]  agent: discover-aws: Instance i-043af3cc8de188114 has private ip 10.27.54.6: cluster=LAN
2021-04-14T19:16:18.021Z [INFO]  agent: discover-aws: Instance i-08e7c7acc017852f4 has private ip 10.27.52.37: cluster=LAN
2021-04-14T19:16:18.021Z [INFO]  agent: discover-aws: Instance i-0d8c8c35b5d4931f0 has private ip 10.27.53.68: cluster=LAN
2021-04-14T19:16:18.021Z [INFO]  agent: Discovered servers: cluster=LAN cluster=LAN servers="10.27.54.6 10.27.52.37 10.27.53.68"
2021-04-14T19:16:18.021Z [INFO]  agent: (LAN) joining: lan_addresses=[10.27.54.6, 10.27.52.37, 10.27.53.68]
2021-04-14T19:16:18.024Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: ip-10-27-53-68.us-west-2.compute.internal 10.27.53.68
2021-04-14T19:16:18.025Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: ip-10-27-52-37.us-west-2.compute.internal 10.27.52.37
2021-04-14T19:16:18.025Z [WARN]  agent.server.memberlist.lan: memberlist: Refuting a suspect message (from: ip-10-27-54-6.us-west-2.compute.internal)
2021-04-14T19:16:18.025Z [INFO]  agent.server: Adding LAN server: server="ip-10-27-53-68.us-west-2.compute.internal (Addr: tcp/10.27.53.68:8300) (DC: vault-wei)"
2021-04-14T19:16:18.025Z [INFO]  agent.server: Adding LAN server: server="ip-10-27-52-37.us-west-2.compute.internal (Addr: tcp/10.27.52.37:8300) (DC: vault-wei)"
2021-04-14T19:16:18.028Z [INFO]  agent: (LAN) joined: number_of_nodes=3
2021-04-14T19:16:18.028Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3
2021-04-14T19:16:18.137Z [WARN]  agent.server.raft: failed to get previous log: previous-index=52389681 last-index=52387075 error="log not found"
2021-04-14T19:16:18.179Z [WARN]  agent: Node info update blocked by ACLs: node=f56d7f64-dbcf-506b-5fa3-115b6011226c accessorID=00000000-0000-0000-0000-000000000002
2021-04-14T19:16:18.180Z [INFO]  agent.server.serf.lan: serf: EventMemberUpdate: ip-10-27-54-6.us-west-2.compute.internal
2021-04-14T19:16:18.180Z [INFO]  agent.server: Updating LAN server: server="ip-10-27-54-6.us-west-2.compute.internal (Addr: tcp/10.27.54.6:8300) (DC: vault-wei)"
2021-04-14T19:16:18.186Z [INFO]  agent: Synced service: service=vault:10.27.54.216:8200
2021-04-14T19:16:18.198Z [INFO]  agent: Synced service: service=vault:10.27.54.6:8200
2021-04-14T19:16:18.425Z [INFO]  agent.server.serf.wan: serf: EventMemberUpdate: ip-10-27-54-6.us-west-2.compute.internal.vault-wei
2021-04-14T19:16:18.425Z [INFO]  agent.server: Handled event for server in area: event=member-update server=ip-10-27-54-6.us-west-2.compute.internal.vault-wei area=wan
2021-04-14T19:16:18.491Z [WARN]  agent: Node info update blocked by ACLs: node=f56d7f64-dbcf-506b-5fa3-115b6011226c accessorID=00000000-0000-0000-0000-000000000002
2021-04-14T19:16:18.497Z [INFO]  agent: Synced service: service=vault:10.27.54.216:8200
2021-04-14T19:16:18.502Z [INFO]  agent: Synced service: service=vault:10.27.54.6:8200
2021-04-14T19:16:20.889Z [INFO]  agent.server.serf.wan: serf: Attempting re-join to previously known node: ip-10-27-54-109.us-west-2.compute.internal.vault-wei: 10.27.54.109:8302
2021-04-14T19:16:20.889Z [INFO]  agent.server.serf.lan: serf: Attempting re-join to previously known node: ip-10-27-54-216.us-west-2.compute.internal: 10.27.54.216:8301
2021-04-14T19:16:22.826Z [WARN]  agent: Check missed TTL, is now critical: check=vault:10.27.54.216:8200:vault-sealed-check
2021-04-14T19:16:23.961Z [INFO]  agent.server.serf.lan: serf: Attempting re-join to previously known node: ip-10-27-54-149.us-west-2.compute.internal: 10.27.54.149:8301
2021-04-14T19:16:23.961Z [INFO]  agent.server.serf.wan: serf: Attempting re-join to previously known node: ip-10-27-54-149.us-west-2.compute.internal.vault-wei: 10.27.54.149:8302
2021-04-14T19:16:27.033Z [INFO]  agent.server.serf.lan: serf: Attempting re-join to previously known node: ip-10-27-52-37.us-west-2.compute.internal: 10.27.52.37:8301
2021-04-14T19:16:27.033Z [INFO]  agent.server.serf.wan: serf: Attempting re-join to previously known node: ip-10-27-54-216.us-west-2.compute.internal.vault-wei: 10.27.54.216:8302
2021-04-14T19:16:27.036Z [INFO]  agent.server.serf.lan: serf: Re-joined to previously known node: ip-10-27-52-37.us-west-2.compute.internal: 10.27.52.37:8301
2021-04-14T19:16:30.105Z [INFO]  agent.server.serf.wan: serf: Attempting re-join to previously known node: ip-10-27-54-64.us-west-2.compute.internal.vault-wei: 10.27.54.64:8302
2021-04-14T19:16:33.177Z [INFO]  agent.server.serf.wan: serf: Attempting re-join to previously known node: ip-10-27-54-104.us-west-2.compute.internal.vault-wei: 10.27.54.104:8302
2021-04-14T19:16:36.249Z [WARN]  agent.server.serf.wan: serf: Failed to re-join any previously known node
2021-04-14T19:16:40.431Z [WARN]  agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002
2021-04-14T19:16:44.095Z [INFO]  agent: Newer Consul version available: new_version=1.9.4 current_version=1.9.3
2021-04-14T19:16:55.722Z [WARN]  agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002
2021-04-14T19:17:00.901Z [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader=10.27.52.37:8300
2021-04-14T19:17:00.901Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.27.54.6:8300 [Candidate]" term=8502
2021-04-14T19:17:08.393Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2021-04-14T19:17:08.393Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.27.54.6:8300 [Candidate]" term=8503
2021-04-14T19:17:09.921Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/vault/core/lock from=127.0.0.1:58578 error="No cluster leader"
2021-04-14T19:17:09.995Z [ERROR] agent.http: Request error: method=PUT url=/v1/session/renew/e62f764e-670c-40ef-d91e-2bd8125b7417 from=127.0.0.1:58592 error="No cluster leader"
2021-04-14T19:17:10.464Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/vault/core/lock from=127.0.0.1:58626 error="No cluster leader"
2021-04-14T19:17:10.547Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/vault/core/lock from=127.0.0.1:58628 error="No cluster leader"
2021-04-14T19:17:10.904Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/vault/core/lock from=127.0.0.1:58630 error="No cluster leader"

To provide a bit more context, we are running consul v1.9.3 as backend for our vault.
here’s the configuration that we use for the consul cluster:

{
  "server": true,
  "bootstrap_expect": 3,
  "leave_on_terminate": true,
  "datacenter": "vault-wei",
  "acl_datacenter": "vault-wei",
  "data_dir": "/mnt/apps/consul/data",
  "acl_default_policy": "deny",
  "acl_master_token": "...",
  "encrypt": "...",
  "telemetry": {
    "dogstatsd_addr": "127.0.0.1:8125",
    "dogstatsd_tags": ["consul_datacenter:vault-wei"]
  },
  "retry_join": [ "provider=aws tag_key=ReplicaResourceId tag_value=vault-wei" ],
  "ui": false,
  "disable_remote_exec": true,
  "addresses": {
    "http": "127.0.0.1",
    "https": "127.0.0.1"
  },
  "advertise_addr": "10.27.53.68",
  "ports": {
    "http": 8500,
    "https": 8501
  },
  "key_file": "/mnt/apps/consul/config/consul.key",
  "cert_file": "/mnt/apps/consul/config/consul.crt",
  "ca_file": "/mnt/apps/consul/config/consul.ca",
  "verify_incoming_rpc": true,
  "verify_incoming_https": false,
  "verify_outgoing": true
}

After a few restarts on a couple of nodes, all 3 nodes will join the cluster. curling v1/status/peers or v1/status/leader will show 3 peers and 1 leader. However, on one follower node, I see the following log:

agent: Join cluster failed, will retry: cluster=LAN retry_interval=30s
 error="No servers to join"

So I restarted the consul process on this node. Then another follower node went into no cluster leader. It was able to join the cluster eventually after about 30 min and the cluster went back to having 3 nodes.

All this behavior is pretty puzzling to me. I suspect maybe we have something configured incorrectly or consul has some bug.

Hi @wfeng-fsde,

agent: Join cluster failed, will retry: cluster=LAN retry_interval=30s error=“No servers to join”

Are you seeing the above error after the cluster is bootstrapped and has an active leader? Or is it just when Consul is starting up? If it is just at startup, check whether the systemd unit file is configured to start Consul after the network stack is up (I am making an assumption here that you are using systemd).

What made you restart Consul manually? Was the above error message repeating indefinitely, and never able to join the cluster?

From the errors you shared in first message, it says Heartbeat timeout. This happens when the active leader fails to heartbeat to this node (probably due to a network issue). To find the reason we would need the logs from the then leader (10.27.52.37).

Finally hope you have all the necessary ports opened between nodes as per Consul requirements described here: Consul reference architecture | Consul | HashiCorp Developer

Thanks @Ranjandas for the reply.

Are you seeing the above error after the cluster is bootstrapped and has an active leader? Or is it just when Consul is starting up? If it is just at startup, check whether the systemd unit file is configured to start Consul after the network stack is up (I am making an assumption here that you are using systemd).

The cluster has been bootstrapped.

What made you restart Consul manually? Was the above error message repeating indefinitely, and never able to join the cluster?

Our use case here is actually automated server rotation (for patching). We constantly rotate all our servers. Ideally our server rotating tooling should have logic to drain a consul node (something like consul leave) before killing the underlying EC2 instance. However, first I’m not sure if that’s possible to do from our tooling. And second, we don’t have it at least now even if it is possible.

That error message was repeating indefinitely on one server. However, that server was actually in the cluster. I was watching the cluster members by curling v1/status/peers oand v1/status/leader. So it was surprising for me to see that error message.

From the errors you shared in first message, it says Heartbeat timeout. This happens when the active leader fails to heartbeat to this node (probably due to a network issue). To find the reason we would need the logs from the then leader ( 10.27.52.37 ).

That is quite interesting. Is it possible that our heartbeat timeout setting is too small? From what I’ve seen, that server can communicate with the other 2 servers. When a new server was throwing No cluster leader error, I actually saw that the other 2 servers showing logs rejecting its request to elect a new leader because there is already one.

Finally hope you have all the necessary ports opened between nodes as per Consul requirements

Yes I’m pretty sure we have all necessary ports open. We have more than 5 deployments of consul clusters across our infra with the same settings and have been operating them for more than a year. This issue of No cluster leader during server rotation has been there for a long time. It is normally resolved by a rolling restart of the servers. Or at worst we’ll do a manual recovery. However, I’d like to figure out why we are having this issue.

I don’t think your tooling is issuing a consul leave. If it had, then by default Consul treats leave as a permanent intent to leave the cluster and does not attempt to join the cluster again when starting. But your server logs show that it is trying to join previous known nodes. ref: Agents - Configuration File Reference | Consul | HashiCorp Developer

Could you share the logs from the node itself (like in the first message) and also the active leader when the node is rotated. This will help to understand the view of the Leader about the new node. Also please consider sharing the output of the following commands at various stages instead of a direct API call.

  • consul members
  • consul operator raft list-peers

You can read more about the raft heartbeat timeout settings here: Server Performance Requirements | Consul | HashiCorp Developer

Sorry for the miscommunication on that. I could have stated it better. What I meant is exactly that - our tooling does not issue consul leave at all. It just terminates the server.

Unfortunately the cluster has recovered and I didn’t save the logs. I’ve tried to rotate more servers this way and the problem is not showing up. This is consistent with what we’ve seen before. My impression has been that it has to do with how long the cluster has not been touched. After a fresh rolling restart, which is effectively what happened here, rotation of servers won’t trigger this issue. But it might happen after a while (say a week) without rotating the servers.

I was able to reproduce this same issue on another consul cluster that has been running for 13 days untouched. It’s the same setup basically. It’s a 3-node cluster (A, B, C) and I terminated one node (C). A new EC2 instance D came up, mounted the same persistent volume, and tried to join the cluster. It ended with No cluster leader. I terminated D and another server E came up, which then ended with the same No cluster leader.

Here’s the IP addresses of each server:
A(10.27.52.42),
B(10.27.53.99)
C(10.27.54.207)
D(10.27.54.17)
E(10.27.54.188)
I did not capture the log from server D. However, it should be pretty much the same as E.

I’ve attached logs from A, B and E here.

Currently the cluster only has A and B operational.consul_A.txt (70.4 KB) consul_E.txt (11.7 KB) consul_B.txt (9.8 KB)

The root cause of the issue you are facing is due to the fact that the new servers (D & E) are re-using the raft node-id of server C (0bd86151-612d-47e6-a5c7-08287c3515e) but with a different IP address( D -> 10.27.54.17, E -> 10.27.54.188). The server is re-using the node-id which is persisted in the data directory upon rotating.

Here is what’s happening to your cluster.

Initially when you terminate the node, because of not issuing consul leave, the server will transition to failed state. When a server node fails, the autopilot system will remove the node from Raft Pool.

# Leader Log
2021-04-16T16:02:56.400Z [WARN]  agent: error getting server health from server: server=consul-eap-ip-10-27-54-207-ip-10-27-54-207.us-west-2.compute.internal error="context deadline exceeded"
2021-04-16T16:02:57.385Z [WARN]  agent.server.raft: failed to contact: server-id=0bd86151-612d-47e6-a5c7-08287c3515e5 time=4.943136554s
2021-04-16T16:02:57.400Z [INFO]  agent.server.autopilot: Attempting removal of failed server node: id=0bd86151-612d-47e6-a5c7-08287c3515e5 name=consul-eap-ip-10-27-54-207-ip-10-27-54-207.us-west-2.compute.internal address=10.27.54.207:8300
2021-04-16T16:02:57.400Z [INFO]  agent.server.serf.lan: serf: EventMemberLeave (forced): consul-eap-ip-10-27-54-207-ip-10-27-54-207.us-west-2.compute.internal 10.27.54.207
2021-04-16T16:02:57.400Z [INFO]  agent.server: Removing LAN server: server="consul-eap-ip-10-27-54-207-ip-10-27-54-207.us-west-2.compute.internal (Addr: tcp/10.27.54.207:8300) (DC: apiq-dev-us1-wei-eap)"
2021-04-16T16:02:57.400Z [INFO]  agent.server.raft: updating configuration: command=RemoveServer server-id=0bd86151-612d-47e6-a5c7-08287c3515e5 server-addr= servers="[{Suffrage:Voter ID:659cafd8-5944-ecc0-39f4-9639d6f91395 Address:10.27.52.42:8300} {Suffrage:Voter ID:607ffdfd-64be-90b8-2ee1-f65c801aef7a Address:10.27.53.99:8300}]"
2021-04-16T16:02:57.402Z [INFO]  agent.server.raft: removed peer, stopping replication: peer=0bd86151-612d-47e6-a5c7-08287c3515e5 last-index=244887
2021-04-16T16:02:57.406Z [INFO]  agent.server.autopilot: removed server: id=0bd86151-612d-47e6-a5c7-08287c3515e5
2021-04-16T16:02:57.406Z [INFO]  agent.server: deregistering member: member=consul-eap-ip-10-27-54-207-ip-10-27-54-207.us-west-2.compute.internal reason=left

Now AWS ASG will replace the terminated server with a new one, but with the same data directory mounted. The new server reads the existing peer list from the data directory, and you can see that it has C in its raft member list.

# Logs from E
2021-04-16T16:16:55.004Z [INFO]  agent.server.raft: restored from snapshot: id=2-229447-1618519891672
2021-04-16T16:16:58.942Z [INFO]  agent.server.raft: initial configuration: index=245040 servers="[{Suffrage:Voter ID:659cafd8-5944-ecc0-39f4-9639d6f91395 Address:10.27.52.42:8300} {Suffrage:Voter ID:607ffdfd-64be-90b8-2ee1-f65c801aef7a Address:10.27.53.99:8300} {Suffrage:Voter ID:0bd86151-612d-47e6-a5c7-08287c3515e5 Address:10.27.54.17:8300}]"

After restoring the snapshot, E joins the existing cluster. Leader adds the new node to the cluster (with same node ID but different IP address)

# Leader Log
2021-04-16T16:16:59.167Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: consul-eap-ip-10-27-54-188-ip-10-27-54-188.us-west-2.compute.internal 10.27.54.188
2021-04-16T16:16:59.167Z [INFO]  agent.server: Adding LAN server: server="consul-eap-ip-10-27-54-188-ip-10-27-54-188.us-west-2.compute.internal (Addr: tcp/10.27.54.188:8300) (DC: apiq-dev-us1-wei-eap)"
2021-04-16T16:16:59.168Z [INFO]  agent.server.raft: updating configuration: command=AddNonvoter server-id=0bd86151-612d-47e6-a5c7-08287c3515e5 server-addr=10.27.54.188:8300 servers="[{Suffrage:Voter ID:659cafd8-5944-ecc0-39f4-9639d6f91395 Address:10.27.52.42:8300} {Suffrage:Voter ID:607ffdfd-64be-90b8-2ee1-f65c801aef7a Address:10.27.53.99:8300} {Suffrage:Nonvoter ID:0bd86151-612d-47e6-a5c7-08287c3515e5 Address:10.27.54.188:8300}]"

Raft starts replicating to to the server, which gets rejected initially but then sends to older logs to compensate.

# Leader Log
2021-04-16T16:16:59.170Z [INFO]  agent.server.raft: added peer, starting replication: peer=0bd86151-612d-47e6-a5c7-08287c3515e5
2021-04-16T16:16:59.176Z [WARN]  agent.server.raft: appendEntries rejected, sending older logs: peer="{Nonvoter 0bd86151-612d-47e6-a5c7-08287c3515e5 10.27.54.188:8300}" next=245049

Next stage is where things goes terribly wrong. The AutoPilot system tries to remove the server with node-id 0bd86151-612d-47e6-a5c7-08287c3515e5 (with old IP) at the same time tries to promote the server with node-id 0bd86151-612d-47e6-a5c7-08287c3515e5 (with new IP) as a Voter.

# Leader Log
2021-04-16T16:17:17.316Z [INFO]  agent.server.autopilot: Attempting removal of failed server node: id=0bd86151-612d-47e6-a5c7-08287c3515e5 name=consul-eap-ip-10-27-54-17-ip-10-27-54-17.us-west-2.compute.internal address=10.27.54.17:8300
2021-04-16T16:17:27.316Z [INFO]  agent.server.autopilot: Promoting server: id=0bd86151-612d-47e6-a5c7-08287c3515e5 address=10.27.54.188:8300 name=consul-eap-ip-10-27-54-188-ip-10-27-54-188.us-west-2.compute.internal
2021-04-16T16:17:27.316Z [INFO]  agent.server.raft: updating configuration: command=AddStaging server-id=0bd86151-612d-47e6-a5c7-08287c3515e5 server-addr=10.27.54.188:8300 servers="[{Suffrage:Voter ID:659cafd8-5944-ecc0-39f4-9639d6f91395 Address:10.27.52.42:8300} {Suffrage:Voter ID:607ffdfd-64be-90b8-2ee1-f65c801aef7a Address:10.27.53.99:8300} {Suffrage:Voter ID:0bd86151-612d-47e6-a5c7-08287c3515e5 Address:10.27.54.188:8300}]"
2021-04-16T16:17:27.321Z [INFO]  agent.server.autopilot: Attempting removal of failed server node: id=0bd86151-612d-47e6-a5c7-08287c3515e5 name=consul-eap-ip-10-27-54-207-ip-10-27-54-207.us-west-2.compute.internal address=10.27.54.207:8300
2021-04-16T16:17:37.313Z [INFO]  agent.server.raft: updating configuration: command=RemoveServer server-id=0bd86151-612d-47e6-a5c7-08287c3515e5 server-addr= servers="[{Suffrage:Voter ID:659cafd8-5944-ecc0-39f4-9639d6f91395 Address:10.27.52.42:8300} {Suffrage:Voter ID:607ffdfd-64be-90b8-2ee1-f65c801aef7a Address:10.27.53.99:8300}]"
2021-04-16T16:17:37.314Z [INFO]  agent.server.raft: removed peer, stopping replication: peer=0bd86151-612d-47e6-a5c7-08287c3515e5 last-index=245127
2021-04-16T16:17:37.315Z [INFO]  agent.server.raft: aborting pipeline replication: peer="{Nonvoter 0bd86151-612d-47e6-a5c7-08287c3515e5 10.27.54.188:8300}"
2021-04-16T16:17:37.317Z [INFO]  agent.server.autopilot: removed server: id=0bd86151-612d-47e6-a5c7-08287c3515e5

Finally AutoPilot succeeds in removing the node (based on node-id) but its the new server that is getting removed in this case.

# Log on E
2021-04-16T16:17:50.070Z [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader=10.27.52.42:8300
2021-04-16T16:17:50.070Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.27.54.188:8300 [Candidate]" term=41
2021-04-16T16:17:56.509Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2021-04-16T16:17:56.509Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.27.54.188:8300 [Candidate]" term=42
2021-04-16T16:18:06.007Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2021-04-16T16:18:06.007Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.27.54.188:8300 [Candidate]" term=43
2021-04-16T16:18:08.188Z [ERROR] agent: Coordinate update error: error="No cluster leader"

Once the node gets removed from the cluster, Raft Leader stops sending heartbeat to the node (this is why we saw heartbeat timeout in your case). This triggers an Election on E, but it will never get a majority as there is already a cluster leader. So this keeps repeating.

# Leader Log
2021-04-16T16:17:50.078Z [WARN]  agent.server.raft: rejecting vote request since we have a leader: from=10.27.54.188:8300 leader=10.27.52.42:8300
2021-04-16T16:17:56.511Z [WARN]  agent.server.raft: rejecting vote request since we have a leader: from=10.27.54.188:8300 leader=10.27.52.42:8300

In between, E tries to join cluster again. Leader adds the node to raft pool and starts replication.

# Leader Log
2021-04-16T16:18:37.313Z [INFO]  agent.server.raft: updating configuration: command=AddNonvoter server-id=0bd86151-612d-47e6-a5c7-08287c3515e5 server-addr=10.27.54.188:8300 servers="[{Suffrage:Voter ID:659cafd8-5944-ecc0-39f4-9639d6f91395 Address:10.27.52.42:8300} {Suffrage:Voter ID:607ffdfd-64be-90b8-2ee1-f65c801aef7a Address:10.27.53.99:8300} {Suffrage:Nonvoter ID:0bd86151-612d-47e6-a5c7-08287c3515e5 Address:10.27.54.188:8300}]"
2021-04-16T16:18:37.315Z [INFO]  agent.server.raft: added peer, starting replication: peer=0bd86151-612d-47e6-a5c7-08287c3515e5

But this time (because of those unwanted elections) the election term of E kept growing higher than the current leaders. So now the leader things that E is more up-to-date (because of higher term) and decides to enter into follower state.

# Leader Log
2021-04-16T16:18:37.316Z [ERROR] agent.server.raft: peer has newer term, stopping replication: peer="{Nonvoter 0bd86151-612d-47e6-a5c7-08287c3515e5 10.27.54.188:8300}"
2021-04-16T16:18:37.316Z [INFO]  agent.server.raft: entering follower state: follower="Node at 10.27.52.42:8300 [Follower]" leader=
2021-04-16T16:18:37.316Z [INFO]  agent.server.raft: aborting pipeline replication: peer="{Voter 607ffdfd-64be-90b8-2ee1-f65c801aef7a 10.27.53.99:8300}"
2021-04-16T16:18:37.316Z [ERROR] agent.server.autopilot: failed to add raft non-voting peer: id=0bd86151-612d-47e6-a5c7-08287c3515e5 address=10.27.54.188:8300 error="leadership lost while committing log"

Now with a higher term, E starts a new election. But then every other node in the cluster finds that, the index of E is older compared to what they have (this is because other nodes in the cluster was active and the index was getting incremented).

Now because there is no active leader at this stage, every node ends up having heartbeat timeout and one of them will enter into candidate state for a new election and wins the election and acquires leadership.

# Leader Log
2021-04-16T16:18:43.367Z [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader=
2021-04-16T16:18:43.367Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.27.52.42:8300 [Candidate]" term=49
2021-04-16T16:18:43.378Z [INFO]  agent.server.raft: election won: tally=2
2021-04-16T16:18:43.378Z [INFO]  agent.server.raft: entering leader state: leader="Node at 10.27.52.42:8300 [Leader]"
2021-04-16T16:18:43.378Z [INFO]  agent.server.raft: added peer, starting replication: peer=607ffdfd-64be-90b8-2ee1-f65c801aef7a
2021-04-16T16:18:43.378Z [INFO]  agent.server: cluster leadership acquired

Now again autopilot will come into action and remove E and the story repeats.

The reason why you are not seeing this issue on a new cluster would be because the new server is re-using the IP of the old node and Raft is happy about it. (Request you to verify this at your end)

Let me know if you have any questions on this.

2 Likes

Thanks @Ranjandas. That was a fantastic write up and I learned a lot about the process.

It seems to me that the root cause of the issue here is us re-mounting the same data directory for consul after server rotation, which preserves the node id persisted in the data directory, however is conflicting with consul’s cluster membership management logic, wherein it uses both ip and node id to uniquely identify a node.

However I’m still not sure about the mechanism that makes the cluster right after a rolling restart (no server rotation) once it gets in this state. Why wouldn’t the consul autopilot interfere in such a case? Or how does a rolling restart break this bad cycle?

I attached the logs from my cluster after a rolling restart. I don’t see any logs like the following anymore:
agent.server.autopilot: Attempting removal of failed server node: id=0bd86151-612d-47e6-a5c7-08287c3515e5 name=...

And a further follow up question, should we ditch the persistent volume in order to make this process robust? That can actually be dangerous for us since if we ever lose 2 servers then we might lose all the data. At least with the persistent volume we can still fix the cluster by intervening with either rolling restart or manual recovery.

Or is there a way to avoid the conflict with consul node id management? I see that under consul data directory, we have a node-id file that is storing that node id. What if we remove that file (if it exists) during consul bootstrap?rolling_A.txt (15.2 KB) rolling_B.txt (15.5 KB) rolling_E.txt (12.9 KB)

If anyone knows whether consul relies on node id and ip both to uniquely identify a node, and whether our practice of effectively keeping the same node id via persistent volume mounting conflicts with consul membership management logic, please comment.

In the mean time, I’ve experimented a bit by removing the node-id file from consul data directory when the servers are rotated. The new server will detect that it has mounted to a volume that was mounted previously to a different server, and it will remove the node-id file before starting consul. Consul in this case will generate a new node id, and according to this PR:

changing the node id should not cause any issues. This is consistent with my observation. After making this change, rotating out a single server does not cause the same No cluster leader problem anymore.

However, I do wonder if changing the node id causes the whole data under consul data directory to be discarded and resync’ed from the leader. Or would consul be smart enough to recognize that the data is still valid to a point and only sync with the leader incremental changes?

The reason why a rolling restart fixes your issue is that when you restart the Leader node, AutoPilot starts all over again fresh. I am wondering whether the behavior in your case where a healthy node getting removed should be filed as a bug or not :thinking:

Regarding the follow-up question, even if you lose 2 servers you will still have data on the third. It’s just that you won’t have an active cluster. You can still recover your cluster from the data on the third node (to the extent up to which data is synced to 3rd node). You could also increase the number of Servers to 5 to increase the availability. But as you rightly said, it won’t be as easy as having a persistent volume.

You can also use the snapshot feature of Consul to backup the state of your cluster which you can use in the event of a disaster.

At this stage I am not making any recommendations. Will wait and see what other members have to say about this issue.

We already do take snapshots of our persistent volume for all our other stateful services. Have not turned it on for consul. Our main use cases have been as a service discovery tool and as a kv store. For kv store we store static values and populate them from github. So overall we can rebuild consul afresh if disaster hits and we lose persistent volumes or they get corrupted. However I’m having some second thoughts about this decision. Maybe we should turn on snapshots for consul for the case of corrupted persistent volume. That may speed up our recovery process compared to rebuilding the cluster from zero.

Even if we do take snapshots, in that scenario our new servers will have different IP’s and the conflict with node id will still be there. Manual outage recovery should solve that so maybe that’s not an issue there.

For the issue of server rotation, I’ve done extensive destructive testing by terminating a random server every 7 minutes for 2 clusters after I introduced the logic of removing node-id file for new servers. One cluster remained operational after 17 hours and over 150 servers terminated. Another did get into No cluster leader state after 15 hours. I think we might have hit some other edge case there due to the excessive server rotation. We are going to stick to this solution for now until we can improve it even more.