Hi everyone,
I’m facing some issues on failover and outage recovery of a three-nodes Consul cluster.
I’m running Ubuntu 18.04 on every node of the cluster and Consul agent v1.8.0. The Cluster was built following the deployment guide.
Scenario
Server A (leader)
Server B
Server C
Steps to reproduce the failover and outage (I’m marking leaders in bold):
- Consul Cluster up & running with 3 servers [A, B, C]
- Kill Server A, B becomes leader [B, C]
- Kill Server B, C becomes leader [C]
- Start Server A, C remains leader [C, A]
- Kill Server C, A becomes leader [A]
- Start Server B → No cluster leader [?]
Question 1: is it correct to not have a cluster leader at this point?
According to the deployment table, the cluster should be fine with 2 nodes over 3, or am I missing something?
At this point we have Server A and Server B running, but no cluster leader. Thus, we need to perform a manual recovery of the cluster (I followed the steps for the outage recovery described here).
Outage Recovery Phase
- Stop Consul service on both servers:
$ systemctl stop consul
- Add
peers.jsonfile in both Consul’s data-dir:
[
{
"id": "<node-id-server-A>",
"address": "<ipaddr-server-A>:8300",
"non_voter": false
},
{
"id": "<node-id-server-B>",
"address": "<ipaddr-server-B>:8300",
"non_voter": false
}
]
where node-id-server-X can be found in <consul-data-dir>/node-id.
- Start Consul service on both machines:
$ systemctl start consul
As soon as the services are up and running, a new leader will be elected.
Be aware that in case of a failed server, we need to force-leave the failed node:
consul force-leave <node-name>In this case the node will be marked as
leftinstead offailed. In this way, quorum size will be affected and the cluster is not faulty (it doesn’t mean that it’s fully healthy, but at least it’s healthy for recovery).
- Raft leadership can be inspected with the operator command:
$ consul operator raft list-peers
Node ID Address State Voter RaftProtocol
server-A 01123ade-c12b-0d13-007f-5a0f11b30819 172.16.254.37:8300 follower true 3
server-B 49aa0380-4fb1-48ea-9c07-573dc9584675 172.16.2.82:8300 leader true 3
- We can also check the cluster members:
$ consul members
Node Address Status Type Build Protocol DC Segment
server-A 172.16.254.37:8301 alive server 1.8.0 2 dc1 <all>
server-B 172.16.2.82:8301 alive server 1.8.0 2 dc1 <all>
At this point the cluster is fully healthy.
Now let’s suppose that we have also Client agents in the cluster (perfectly joined before the outage). I followed again the steps I previously described and after the recovery phase, the Client Agents are constantly logging:
...
2020-07-06T12:16:41.990Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
2020-07-06T12:16:41.991Z [WARN] agent.client.manager: No servers available
2020-07-06T12:16:41.991Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
2020-07-06T12:16:41.992Z [WARN] agent.client.manager: No servers available
2020-07-06T12:16:41.992Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
2020-07-06T12:16:41.993Z [WARN] agent.client.manager: No servers available
2020-07-06T12:16:41.993Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
2020-07-06T12:16:43.348Z [WARN] agent.client.manager: No servers available
2020-07-06T12:16:43.348Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
...
I think that they are still trying to connect to the non-healthy cluster. Restarting the Consul Agents on Clients will make the job.
Question 2: Is there a way to auto-recover also the Clients?
Thank you in advance for any answer!
