3-node Cluster - Failover and Outage Recovery

Hi everyone,

I’m facing some issues on failover and outage recovery of a three-nodes Consul cluster.
I’m running Ubuntu 18.04 on every node of the cluster and Consul agent v1.8.0. The Cluster was built following the deployment guide.

Scenario

Server A (leader)
Server B
Server C

Steps to reproduce the failover and outage (I’m marking leaders in bold):

  1. Consul Cluster up & running with 3 servers [A, B, C]
  2. Kill Server A, B becomes leader [B, C]
  3. Kill Server B, C becomes leader [C]
  4. Start Server A, C remains leader [C, A]
  5. Kill Server C, A becomes leader [A]
  6. Start Server B → No cluster leader [?]

Question 1: is it correct to not have a cluster leader at this point?

According to the deployment table, the cluster should be fine with 2 nodes over 3, or am I missing something?

At this point we have Server A and Server B running, but no cluster leader. Thus, we need to perform a manual recovery of the cluster (I followed the steps for the outage recovery described here).

Outage Recovery Phase

  • Stop Consul service on both servers:
$ systemctl stop consul
  • Add peers.json file in both Consul’s data-dir:
[
  {
    "id": "<node-id-server-A>",
    "address": "<ipaddr-server-A>:8300",
    "non_voter": false
  },
  {
    "id": "<node-id-server-B>",
    "address": "<ipaddr-server-B>:8300",
    "non_voter": false
  }
]

where node-id-server-X can be found in <consul-data-dir>/node-id.

  • Start Consul service on both machines:
$ systemctl start consul

As soon as the services are up and running, a new leader will be elected.

:warning: Be aware that in case of a failed server, we need to force-leave the failed node:

consul force-leave <node-name>

In this case the node will be marked as left instead of failed. In this way, quorum size will be affected and the cluster is not faulty (it doesn’t mean that it’s fully healthy, but at least it’s healthy for recovery :slight_smile: ).

  • Raft leadership can be inspected with the operator command:
$ consul operator raft list-peers

Node      ID                                    Address             State     Voter  RaftProtocol
server-A  01123ade-c12b-0d13-007f-5a0f11b30819  172.16.254.37:8300  follower  true   3
server-B  49aa0380-4fb1-48ea-9c07-573dc9584675  172.16.2.82:8300    leader    true   3
  • We can also check the cluster members:
$ consul members

Node      Address              Status  Type    Build  Protocol  DC   Segment
server-A  172.16.254.37:8301   alive   server  1.8.0  2         dc1  <all>
server-B  172.16.2.82:8301     alive   server  1.8.0  2         dc1  <all>

At this point the cluster is fully healthy.

Now let’s suppose that we have also Client agents in the cluster (perfectly joined before the outage). I followed again the steps I previously described and after the recovery phase, the Client Agents are constantly logging:

...
2020-07-06T12:16:41.990Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
2020-07-06T12:16:41.991Z [WARN]  agent.client.manager: No servers available
2020-07-06T12:16:41.991Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
2020-07-06T12:16:41.992Z [WARN]  agent.client.manager: No servers available
2020-07-06T12:16:41.992Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
2020-07-06T12:16:41.993Z [WARN]  agent.client.manager: No servers available
2020-07-06T12:16:41.993Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
2020-07-06T12:16:43.348Z [WARN]  agent.client.manager: No servers available
2020-07-06T12:16:43.348Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
...

I think that they are still trying to connect to the non-healthy cluster. Restarting the Consul Agents on Clients will make the job.

Question 2: Is there a way to auto-recover also the Clients?

Thank you in advance for any answer!

2 Likes

Failure tolerance with a 3 node cluster is 1. You have exceeded this by stopping 2 of your nodes. In a 3 node cluster, replacement of any failed node is critical.
I can’t find a link to an HA deployment just w consul, but the Vault architecture shows the Consul backend architecture for more resiliency Vault with Consul storage reference architecture | Vault | HashiCorp Developer

If you need to lose 2 nodes in a cluster, you’ll need 5 nodes total.

Thank you for your answer Mike :slight_smile:

What I’m trying to ask is: according to the deployment table (I linked it in my previous post), if a 3-nodes Cluster can tolerate a failure of just 1 node, I’d expect that it’s working with 2 nodes, no matter if the Cluster goes down to 1 node and then up to 2 or it goes down to 2 from 3… they are still 2 nodes up and running, right?

Now let’s suppose that we have a 5-nodes Cluster. Still according to the table, we can have a failure of 2 nodes. Then, what if we go down to 3 nodes from 5, then 2 and again back to 3? Does it mean that we don’t have a reliable Cluster anymore?
Does it mean that we need to perform the outage recovery in order to make the Cluster healthy again? If for some reason there’s a network scattering on 3 nodes over 5, I could lose the Cluster forever, and that’s absolutely not tolerable in a HA solution in my opinion.

You can see that we can easily scale this up beyond any limit.

Also this: once the outage recovery is successfully done, we will have every Client Agent polling a no-more-existing Cluster. Couldn’t it be more useful to auto-kill the Client Agent after a timeout and then auto-rejoin the new-recovered Cluster?

Thank you in advance for your answers!

In your scenario, the first node that went down has to re-join the cluster to establish quorum again before you take down any other node.

In a 5 node cluster you can bring down 2 nodes but you want to re-establish quorum before any other downed nodes.

Once you’ve lost quorum, bad things happen. If your scenario of killing all the servers at various times, you have a gap where there can’t be a leader elected.

Take a look at:


and
https://sitano.github.io/2015/10/06/abt-consul-outage/

@hawk87,

Take a look at the conversation in this issue on GitHub regarding the min_quorum configuration setting. I believe this is relevant to the failure scenario you described.

This GitHub comment on hashicorp/consul#6672 describes the behavior of the retry_join parameter. If all of the servers become simultaneously unavailable, you will need to restart or manually re-join the clients to the servers once they return.