Hi,
I have four kubernetes nodes
kubectl get nodes
NAME STATUS ROLES AGE VERSION
192.168.1.116 Ready controlplane,etcd,worker 12h v1.18.9-6+3d5bea383c722a
192.168.1.172 Ready worker 12h v1.18.9-6+3d5bea383c722a
192.168.1.20 Ready controlplane,etcd,worker 12h v1.18.9-6+3d5bea383c722a
192.168.1.88 Ready controlplane,etcd,worker 12h v1.18.9-6+3d5bea383c722a
kubectl get pods -n consul -o wide|grep server
consul-server-0 1/1 Running 0 45m 10.42.3.45 192.168.1.172 <none> <none>
consul-server-1 1/1 Running 0 45m 10.42.0.16 192.168.1.20 <none> <none>
consul-server-2 1/1 Running 0 46m 10.42.2.125 192.168.1.116 <none> <none>
I do the following steps to reproduce this issue:
-
sudo systemctl stop docker
in 192.168.1.116 : after doing that consul cluster work normally -
sudo systemctl start docker
in 192.168.1.116 and at the same timesudo systemctl stop docker
in 192.168.1.172:
kubectl get pods -n consul -o wide|grep server
consul-server-0 1/1 Terminating 0 45m 10.42.3.45 192.168.1.172 <none> <none>
consul-server-1 0/1 Running 0 45m 10.42.0.16 192.168.1.20 <none> <none>
consul-server-2 0/1 Running 0 46m 10.42.2.125 192.168.1.116 <none> <none>
The consul cluster do not work.
For getting more log for what’s happening here, I just add some logs in GitHub - hashicorp/raft: Golang implementation of the Raft consensus protocol and rebuild consul-1.9.1
In consul-server-2:
Consul members as below
/ # consul members
Node Address Status Type Build Protocol DC Segment
consul-server-1 10.42.0.16:8301 alive server 1.9.1dev 2 pri <all>
consul-server-2 10.42.2.125:8301 alive server 1.9.1dev 2 pri <all>
And consul-server-2 always restarts election
2021-09-29T02:44:30.385Z [WARN] agent.server.raft: Election timeout reached, restarting election
2021-09-29T02:44:30.385Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.42.2.125:8300 [Candidate]" term=92
2021-09-29T02:44:30.385Z [INFO] agent.server.raft: current server:: id->=697db73f-6bb9-f717-1c24-81139c980650 address->=10.42.3.45:8300
2021-09-29T02:44:30.385Z [INFO] agent.server.raft: current server:: id->=697db73f-6bb9-f717-1c24-81139c980651 address->=10.42.0.16:8300
2021-09-29T02:44:30.385Z [INFO] agent.server.raft: current server:: id->=697db73f-6bb9-f717-1c24-81139c980652 address->=10.42.2.88:8300
2021-09-29T02:44:30.385Z [INFO] agent.server.raft: ask peer to vote:: id->=697db73f-6bb9-f717-1c24-81139c980650 address->=10.42.3.45:8300
Could not find address for server id 697db73f-6bb9-f717-1c24-81139c980650
ID: 697db73f-6bb9-f717-1c24-81139c980652 Name: consul-server-2 Addr: 10.42.2.125:8300
ID: 697db73f-6bb9-f717-1c24-81139c980651 Name: consul-server-1 Addr: 10.42.0.16:8300
2021-09-29T02:44:30.386Z [WARN] agent.server.raft: unable to get address for server, using fallback address: id=697db73f-6bb9-f717-1c24-81139c980650 fallback=10.42.3.45:8300 error="Could not find address for server id 697db73f-6bb9-f717-1c24-81139c980650"
2021-09-29T02:44:30.386Z [INFO] agent.server.raft: ask peer to vote:: id->=697db73f-6bb9-f717-1c24-81139c980651 address->=10.42.0.16:8300
2021-09-29T02:44:30.973Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 697db73f-6bb9-f717-1c24-81139c980650 10.42.3.45:8300}" error="dial tcp <nil>->10.42.3.45:8300: i/o timeout"
2021-09-29T02:44:31.986Z [WARN] agent: Syncing node info failed.: error="No cluster leader"
2021-09-29T02:44:31.986Z [ERROR] agent: failed to sync changes: error="No cluster leader"
2021-09-29T02:44:36.743Z [WARN] agent.server.raft: Election timeout reached, restarting election
2021-09-29T02:44:36.743Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.42.2.125:8300 [Candidate]" term=93
and in consul-server-1
/ # consul members
Node Address Status Type Build Protocol DC Segment
consul-server-0 10.42.3.45:8301 failed server 1.9.1dev 2 pri <all>
consul-server-1 10.42.0.16:8301 alive server 1.9.1dev 2 pri <all>
consul-server-2 10.42.2.125:8301 alive server 1.9.1dev 2 pri <all>
and logs
2021-09-29T02:44:22.433Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.42.0.16:8300 [Candidate]" term=92
2021-09-29T02:44:22.433Z [INFO] agent.server.raft: current server:: id->=697db73f-6bb9-f717-1c24-81139c980650 address->=10.42.3.45:8300
2021-09-29T02:44:22.433Z [INFO] agent.server.raft: current server:: id->=697db73f-6bb9-f717-1c24-81139c980651 address->=10.42.0.16:8300
2021-09-29T02:44:22.433Z [INFO] agent.server.raft: ask peer to vote:: id->=697db73f-6bb9-f717-1c24-81139c980650 address->=10.42.3.45:8300
Could not find address for server id 697db73f-6bb9-f717-1c24-81139c980650
ID: 697db73f-6bb9-f717-1c24-81139c980651 Name: consul-server-1 Addr: 10.42.0.16:8300
ID: 697db73f-6bb9-f717-1c24-81139c980652 Name: consul-server-2 Addr: 10.42.2.125:8300
2021-09-29T02:44:22.433Z [WARN] agent.server.raft: unable to get address for server, using fallback address: id=697db73f-6bb9-f717-1c24-81139c980650 fallback=10.42.3.45:8300 error="Could not find address for server id 697db73f-6bb9-f717-1c24-81139c980650"
2021-09-29T02:44:22.920Z [INFO] agent.server.serf.lan: serf: attempting reconnect to consul-server-0 10.42.3.45:8301
2021-09-29T02:44:24.967Z [ERROR] agent: failed to sync changes: error="No cluster leader"
2021-09-29T02:44:25.921Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 697db73f-6bb9-f717-1c24-81139c980650 10.42.3.45:8300}" error="dial tcp <nil>->10.42.3.45:8300: i/o timeout"
2021-09-29T02:44:30.386Z [INFO] agent.server.raft: requestVote: from : 10.42.2.125:8300="req term" %!s(uint64=92)="current term" EXTRA_VALUE_AT_END=92
2021-09-29T02:44:30.386Z [INFO] agent.server.raft: duplicate requestVote for same term: term=92
2021-09-29T02:44:32.079Z [ERROR] agent: failed to sync changes: error="No cluster leader"
2021-09-29T02:44:32.220Z [WARN] agent.server.raft: Election timeout reached, restarting election
2021-09-29T02:44:32.220Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.42.0.16:8300 [Candidate]" term=93
and a WARN message agent.server.raft: rejecting vote request since our last index is greater: candidate=10.42.2.125:8300 last-index=8174 last-candidate-index=8033
2021-09-29T02:55:48.722Z [WARN] agent.server.raft: Election timeout reached, restarting election
2021-09-29T02:55:48.722Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.42.0.16:8300 [Candidate]" term=189
2021-09-29T02:55:48.722Z [INFO] agent.server.raft: current server:: id->=697db73f-6bb9-f717-1c24-81139c980650 address->=10.42.3.45:8300
2021-09-29T02:55:48.722Z [INFO] agent.server.raft: current server:: id->=697db73f-6bb9-f717-1c24-81139c980651 address->=10.42.0.16:8300
2021-09-29T02:55:48.722Z [INFO] agent.server.raft: ask peer to vote:: id->=697db73f-6bb9-f717-1c24-81139c980650 address->=10.42.3.45:8300
Could not find address for server id 697db73f-6bb9-f717-1c24-81139c980650
ID: 697db73f-6bb9-f717-1c24-81139c980651 Name: consul-server-1 Addr: 10.42.0.16:8300
ID: 697db73f-6bb9-f717-1c24-81139c980652 Name: consul-server-2 Addr: 10.42.2.125:8300
2021-09-29T02:55:48.723Z [WARN] agent.server.raft: unable to get address for server, using fallback address: id=697db73f-6bb9-f717-1c24-81139c980650 fallback=10.42.3.45:8300 error="Could not find address for server id 697db73f-6bb9-f717-1c24-81139c980650"
2021-09-29T02:55:48.800Z [INFO] agent.server.raft: requestVote: from : 10.42.2.125:8300="req term" %!s(uint64=188)="current term" EXTRA_VALUE_AT_END=189
2021-09-29T02:55:48.800Z [INFO] agent.server.raft: req term: %!s(uint64=188)="is less than" EXTRA_VALUE_AT_END=189
2021-09-29T02:55:52.546Z [ERROR] agent: failed to sync changes: error="No cluster leader"
2021-09-29T02:55:53.212Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 697db73f-6bb9-f717-1c24-81139c980650 10.42.3.45:8300}" error="dial tcp <nil>->10.42.3.45:8300: i/o timeout"
2021-09-29T02:55:56.134Z [INFO] agent.server.raft: requestVote: from : 10.42.2.125:8300="req term" %!s(uint64=190)="current term" EXTRA_VALUE_AT_END=189
2021-09-29T02:55:56.135Z [WARN] agent.server.raft: rejecting vote request since our last index is greater: candidate=10.42.2.125:8300 last-index=8174 last-candidate-index=8033
2021-09-29T02:55:56.135Z [INFO] agent.server.raft: entering follower state: follower="Node at 10.42.0.16:8300 [Follower]" leader=
2021-09-29T02:55:58.723Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 697db73f-6bb9-f717-1c24-81139c980650 10.42.3.45:8300}" error="dial tcp <nil>->10.42.3.45:8300: i/o timeout"
2021-09-29T02:55:59.420Z [ERROR] agent: Coordinate update error: error="No cluster leader"
2021-09-29T02:55:59.701Z [ERROR] agent: failed to sync changes: error="No cluster leader"
2021-09-29T02:56:01.818Z [INFO] agent.server.raft: heartbeat timeout: lastContact: %s HeartbeatTimeout: %s: 2021-09-29 02:39:09.13070157 +0000 UTC m=+1186.321000621=5s
2021-09-29T02:56:01.818Z [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader=
2021-09-29T02:56:01.818Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.42.0.16:8300 [Candidate]" term=191
From the log, consul-server-1’s last index
is greater than consul-server-2
, so consul-server-1
should be leader not consul-server-2
.
But raft log shows that only two server in consul-server-1's
configurations.latest.Servers
which not contains consul-server-2
, so the vote requests will not send to consul-server-2
and the consul-server-1 will never become a leader because of consul-server-0’s dead .
And consul-server-2
has all servers in configurations.latest.Servers
and can start vote normally, but consul-server-2
cannot be leader.
Why consul-server-2
in consul-server-1
's members but not in configurations.latest.Servers
.
This issue can be reproduced by following the step mentioned before.