Consul not electing new leader when blocking network traffic

bwmetcalf · December 15, 2021, 3:06pm

We are using k8s 1.21 in eks and consul 0.37.0. In order to test consul cluster resiliency, one of the tests performed was to block all traffic on an eks worker with iptables. After doing so, the logs for one of the other consul-server pods shows:

{"@level":"error","@message":"memberlist: Failed to send ping: read tcp 10.103.144.99:60580-\u003e10.103.163.214:443: read: connection reset by peer","@module":"agent.server.memberlist.wan","@timestamp":"2021-12-14T21:20:45.721103Z"}
{"@level":"info","@message":"memberlist: Suspect consul-server-1.dev-test-us-west-2 has failed, no acks received","@module":"agent.server.memberlist.wan","@timestamp":"2021-12-14T21:21:00.715708Z"}
{"@level":"error","@message":"memberlist: Push/Pull with consul-server-2 failed: dial tcp 10.103.128.221:8301: i/o timeout","@module":"agent.server.memberlist.lan","@timestamp":"2021-12-14T21:21:17.139782Z"}

but consul never marks the node as failed or elects a new leader. As soon as we terminate the worker, consul immediately logs that the node has failed and elects a new leader.

My guess is the consul has some kind of hook in place to realize a node is terminated, but it seems a node that has lost networking should constitute failure as well. Perhaps I didn’t wait long enough for this to actually happen?

aram · December 16, 2021, 8:26pm

Just for me, that’s too old of a version to even guess at what commands and options were available. 1.11 has been released. If you’re setting up a new cluster, why not use something more recent?

lkysow · December 19, 2021, 12:53am

@aram those versions were for consul-k8s so I think they’re on a new Consul version.

@bwmetcalf how long did you wait? I’ve tested leader election myself recently and it does work as expected after the requisite timeout (which I don’t remember off the top of my head).

Also, you’ll need to include the logs from the other server pods as well.

Ranjandas · December 20, 2021, 12:56pm

@bwmetcalf, Did you block both inbound and outbound and also both TCP and UDP ports using iptables? If you:

Only blocked inbound: then the consul agent will be able to talk to other nodes (8301/udp/tcp) in the cluster, making other nodes think that there might have been a network partition and that the node is alive.
Only blocked outbound: the node will be able to receive pings on port 8301 and will ack and the node will be considered as healthy
Only blocked TCP - Serf by default uses UDP for pings, and TCP is only used as a fallback. So just blocking TCP won’t help to simulate the scenario.
Only blocked UDP - Serf will try UDP and fall back to TCP as it is blocked. So you will see errors that would suggest a network misconfiguration. But the health checks would still pass.

Again, if the worker node you are blocking traffic runs a Consul follower server, the leader election won’t trigger.

Topic		Replies	Views
3-node cluster unhealthy after leader lost network connection Consul	3	4104	March 4, 2021
Classic networking issues Consul k8s	4	960	December 8, 2020
Error: Consul cluster not able to elect a leader Consul consul	2	1327	April 5, 2022
Consul-server always restarts election and no cluster leader Consul k8s	0	426	October 12, 2021
Consul failing to commit leader election results Consul	9	1987	November 22, 2022

Consul not electing new leader when blocking network traffic

Related topics