3-node cluster unhealthy after leader lost network connection

Hi, I have 3 server cluster. When I block the network connection of the leader I expected that the remaining 2 server elect a new leader like they do when I terminate the leader. For some reason this does not happen. The leader is marked failed but the follow stay follower.

# on the leader
 $ consul members list | grep server
ip-10-140-32-151  10.140.32.151:8301  alive   server  1.6.1  2         eu-west-1  <all>
ip-10-140-37-156  10.140.37.156:8301  alive   server  1.6.1  2         eu-west-1  <all>
ip-10-140-41-175  10.140.41.175:8301  alive   server  1.6.1  2         eu-west-1  <all>
 $ consul operator raft list-peers
Node              ID                                    Address             State     Voter  RaftProtocol
ip-10-140-32-151  44ab630a-1258-78d3-d979-2f85b149c358  10.140.32.151:8300  leader    true   3
ip-10-140-37-156  cab982af-0496-5f27-c29e-7118270c633f  10.140.37.156:8300  follower  true   3
ip-10-140-41-175  25dc9732-aaac-3645-0168-a3e7b104cc3f  10.140.41.175:8300  follower  true   3

# on the followers
 $ consul members list | grep server
ip-10-140-32-151  10.140.32.151:8301  failed  server  1.6.1  2         eu-west-1  <all>
ip-10-140-37-156  10.140.37.156:8301  alive   server  1.6.1  2         eu-west-1  <all>
ip-10-140-41-175  10.140.41.175:8301  alive   server  1.6.1  2         eu-west-1  <all>
 $ consul operator raft list-peers
Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)

I even tried to force a leader election by calling

$ consul force-leave ip-10-140-32-151

But the only change was the status of the leader in the members list from failed to left.

In any case restoring the network connection restores the cluster. But during the network down time the cluster is basically not in a functional state.

I’m happy for any hints.

(yes, you see correctly in the outputs above we still use version 1.6.1)

Hi Harald,

It would depend on how you have blocked the network on the leader. If you only blocked the inbound traffic to the existing leader (port 8300/tcp, 8301tcp/udp & 8302tcp/udp), the node is still communicating with the cluster outbound. Which means its still part of the Serf pool and will flap between failed and alive state.

  • failed because other nodes are not able to talk to the node will result in heartbeat failure
  • alive because the node can still talk to its peers on Serf ports

You will be able to see this if you watch the consul members output from different node.

watch -d consul members

You will also see from the logs that the node is constantly getting removed and added back.

[INFO] memberlist: Suspect c2 has failed, no acks received
[INFO] memberlist: Marking c2 as failed, suspect timeout reached (0 peer confirmations)
[INFO] serf: EventMemberFailed: c2 192.168.64.46
[INFO] consul: Removing LAN server c2 (Addr: tcp/192.168.64.46:8300) (DC: dc1)
[INFO] serf: attempting reconnect to c2 192.168.64.46:8301
[INFO] memberlist: Suspect c2 has failed, no acks received
[INFO] serf: EventMemberJoin: c2 192.168.64.46
[INFO] consul: Adding LAN server c2 (Addr: tcp/192.168.64.46:8300) (DC: dc1)

In this situation, the raft still has a leader, but the follower nodes are not in a position to talk to the existing leader reported by the raft. The leader here is still able to talk to a quorum of nodes.

Testing this on 1.9.3 show’s this error message in such a scenario. (Not sure in what version this message is added)

[WARN]  agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=192.168.64.47:8300

force-leave is used when you want to transition a node from failed state to left state in the member list. This is helpful when your node has actually failed, but the cluster still tries to contact the node. In this case, force-leave would change the status, but when the node joins back (as part of flapping) then the state would change back to alive.

If you want to simulate exact scenario like a Leader ungraceful termination make sure that you block both inbound and outbound traffic.

iptables rules like the below would help.

# block inbound RPC
iptables -I INPUT -p tcp --dport 8300 -j DROP

# block inbound Serf LAN & WAN
iptables -I INPUT -p tcp --dport 8301 -j DROP
iptables -I INPUT -p tcp --dport 8301 -j DROP
iptables -I INPUT -p tcp --dport 8302 -j DROP
iptables -I INPUT -p udp --dport 8302 -j DROP

# block outbound RPC
iptables -I OUTPUT -p tcp --dport 8300 -j DROP

# block outbound Serf LAN & WAN
iptables -I OUTPUT -p tcp --dport 8301 -j DROP
iptables -I OUTPUT -p udp --dport 8301 -j DROP
iptables -I OUTPUT -p tcp --dport 8302 -j DROP
iptables -I OUTPUT -p udp --dport 8302 -j DROP

Doing this would result in logs like the following

[WARN]  agent.server.raft: failed to contact: server-id=d9f4ead3-ce7f-e114-ca17-6c64854aa7b5 time=2.500070665s
[WARN]  agent.server.raft: failed to contact: server-id=dc6fbf3d-ec22-f832-63ea-00293a14f1ea time=2.50019952s
[WARN]  agent.server.raft: failed to contact quorum of nodes, stepping down
[INFO]  agent.server.raft: entering follower state: follower="Node at 192.168.64.47:8300 [Follower]" leader=
[INFO]  agent.server.raft: aborting pipeline replication: peer="{Voter d9f4ead3-ce7f-e114-ca17-6c64854aa7b5 192.168.64.48:8300}"
[INFO]  agent.server.raft: aborting pipeline replication: peer="{Voter dc6fbf3d-ec22-f832-63ea-00293a14f1ea 192.168.64.46:8300}"
[WARN]  agent.server.coordinate: Batch update failed: error="leadership lost while committing log"
[INFO]  agent.server: cluster leadership lost

Hope this helps.

1 Like

Thank you very much @Ranjandas for this great description.

I didn’t notice any flapping and I am pretty sure that the leader could not reach the followers and vice versa, I checked with netcat if I could establish a tcp connection (I am aware that would still leave the possibility of udp traffic going through, but as all ports and protocols are controlled by the same SecurityGroup I would be surprised of different behaviour).

Anyways it is worth checking this again and even apply the iptables rules you suggested.

Hi @harald.svab,

Do you mind sharing your SecurityGroup rules? You would need a separate SG attached to each instance to reproduce this with SG alone. I hope this is how you have set up the SG as well.

So in your case, when you say block the network connection of the leader I would expect that both the inbound and outbound rules are empty (or have the non-consul ones) for this individual instance, was that the case in your setup?

Anyways will wait for your findings with iptables.