Hi everyone,
I’m facing some issues on failover and outage recovery of a three-nodes Consul cluster.
I’m running Ubuntu 18.04 on every node of the cluster and Consul agent v1.8.0. The Cluster was built following the deployment guide.
Scenario
Server A (leader)
Server B
Server C
Steps to reproduce the failover and outage (I’m marking leaders in bold):
- Consul Cluster up & running with 3 servers [A, B, C]
- Kill Server A, B becomes leader [B, C]
- Kill Server B, C becomes leader [C]
- Start Server A, C remains leader [C, A]
- Kill Server C, A becomes leader [A]
- Start Server B → No cluster leader [?]
Question 1: is it correct to not have a cluster leader at this point?
According to the deployment table, the cluster should be fine with 2 nodes over 3, or am I missing something?
At this point we have Server A and Server B running, but no cluster leader. Thus, we need to perform a manual recovery of the cluster (I followed the steps for the outage recovery described here).
Outage Recovery Phase
- Stop Consul service on both servers:
$ systemctl stop consul
- Add
peers.json
file in both Consul’s data-dir:
[
{
"id": "<node-id-server-A>",
"address": "<ipaddr-server-A>:8300",
"non_voter": false
},
{
"id": "<node-id-server-B>",
"address": "<ipaddr-server-B>:8300",
"non_voter": false
}
]
where node-id-server-X
can be found in <consul-data-dir>/node-id
.
- Start Consul service on both machines:
$ systemctl start consul
As soon as the services are up and running, a new leader will be elected.
Be aware that in case of a failed server, we need to force-leave the failed node:
consul force-leave <node-name>
In this case the node will be marked as
left
instead offailed
. In this way, quorum size will be affected and the cluster is not faulty (it doesn’t mean that it’s fully healthy, but at least it’s healthy for recovery ).
- Raft leadership can be inspected with the operator command:
$ consul operator raft list-peers
Node ID Address State Voter RaftProtocol
server-A 01123ade-c12b-0d13-007f-5a0f11b30819 172.16.254.37:8300 follower true 3
server-B 49aa0380-4fb1-48ea-9c07-573dc9584675 172.16.2.82:8300 leader true 3
- We can also check the cluster members:
$ consul members
Node Address Status Type Build Protocol DC Segment
server-A 172.16.254.37:8301 alive server 1.8.0 2 dc1 <all>
server-B 172.16.2.82:8301 alive server 1.8.0 2 dc1 <all>
At this point the cluster is fully healthy.
Now let’s suppose that we have also Client agents in the cluster (perfectly joined before the outage). I followed again the steps I previously described and after the recovery phase, the Client Agents are constantly logging:
...
2020-07-06T12:16:41.990Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
2020-07-06T12:16:41.991Z [WARN] agent.client.manager: No servers available
2020-07-06T12:16:41.991Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
2020-07-06T12:16:41.992Z [WARN] agent.client.manager: No servers available
2020-07-06T12:16:41.992Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
2020-07-06T12:16:41.993Z [WARN] agent.client.manager: No servers available
2020-07-06T12:16:41.993Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
2020-07-06T12:16:43.348Z [WARN] agent.client.manager: No servers available
2020-07-06T12:16:43.348Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
...
I think that they are still trying to connect to the non-healthy cluster. Restarting the Consul Agents on Clients will make the job.
Question 2: Is there a way to auto-recover also the Clients?
Thank you in advance for any answer!