Agent.server.raft: failed to make requestVote RPC

hTangle · September 18, 2021, 8:50am

I have deployed three consul-server pods in kubernetes , but one node not work. the pod in k8s work like this:

kubectl get pods -n consul -o wide |grep consul-server
consul-server-0                  0/1     Running       0          37m    10.42.0.49   192.168.1.195   <none>           <none>
consul-server-1                  1/1     Terminating   0          121m   10.42.1.17   192.168.1.50    <none>           <none>
consul-server-2                  0/1     Running       0          37m    10.42.2.60   192.168.1.104   <none>           <none>

and in consul-server-0 && consul-server-2

/ # consul  members
Node             Address          Status  Type    Build  Protocol  DC   Segment
consul-server-0  10.42.0.48:8301  alive   server  1.9.1  2         pri  <all>
consul-server-2  10.42.2.53:8301  alive   server  1.9.1  2         pri  <all>

the consul-server-1 not list here. but agent.server.raft: failed to make requestVote RPC
and log show that consul-server send request to a dead member.

2021-09-18T08:03:14.954Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 30db0865-82d5-9394-6471-ec548671d923 10.42.1.17:8300}" error="dial tcp <nil>->10.42.1.17:8300: i/o timeout"
2021-09-18T08:03:22.890Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 30db0865-82d5-9394-6471-ec548671d923 10.42.1.17:8300}" error="dial tcp <nil>->10.42.1.17:8300: i/o timeout"
2021-09-18T08:03:30.066Z [ERROR] agent: Coordinate update error: error="No cluster leader"

and consul info is

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 59
	services = 84
build:
	prerelease =
	revision = ca5c3894
	version = 1.9.1
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr =
	server = true
raft:
	applied_index = 0
	commit_index = 0
	fsm_pending = 0
	last_contact = never
	last_log_index = 889
	last_log_term = 2
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:bc794178-e391-2f87-95e5-ba952d50284c Address:10.42.2.17:8300} {Suffrage:Voter ID:25f50769-7664-321e-4595-0efa5b256d62 Address:10.42.0.15:8300} {Suffrage:Voter ID:30db0865-82d5-9394-6471-ec548671d923 Address:10.42.1.17:8300}]
	latest_configuration_index = 0
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Candidate
	term = 424
runtime:
	arch = amd64
	cpu_count = 16
	goroutines = 139
	max_procs = 16
	os = linux
	version = go1.15.6
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 12
	members = 2
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 9
	members = 2
	query_queue = 0
	query_time = 1

consul start command

/bin/sh
      -ec
      CONSUL_FULLNAME="consul"

      exec /bin/consul agent \
        -advertise="${ADVERTISE_IP}" \
        -bind=0.0.0.0 \
        -bootstrap-expect=3 \
        -client=0.0.0.0 \
        -config-dir=/consul/config \
        -datacenter=pri \
        -data-dir=/consul/data \
        -config-file=/consul/config/telemetry.hcl \
        -domain=consul \
        -hcl="connect { enabled = true }" \
        -ui \
        -retry-join="${CONSUL_FULLNAME}-server-0.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc:8301" \
        -retry-join="${CONSUL_FULLNAME}-server-1.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc:8301" \
        -retry-join="${CONSUL_FULLNAME}-server-2.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc:8301" \
        -serf-lan-port=8301 \
        -server

I do not know how to solve it, can anyone tell me why?
Thanks

lkysow · September 20, 2021, 4:11pm

You need to do a kubectl describe and kubectl logs for the failed pod and possibly a kubectl describe for the statefulset itself.

hTangle · September 24, 2021, 2:27am

the node of failed pod is dead(shutdown forever), so logs can not get from kubectl command.
I have found the document of consul say that https://www.consul.io/docs/architecture/consensus A Raft cluster of 3 nodes can tolerate a single node failure while a cluster of 5 can tolerate 2 node failures.
I have deployed 3 nodes, when the node of leader dead unexpected, the cluster cannot work normally as except (If the dead node returns to normal, the cluster will also return to normal). while if the other two node dead, the cluster can work as before.

Should I change start command?

Topic		Replies	Views
Consul-server always restarts election and no cluster leader Consul k8s	0	374	October 12, 2021
Raft: Failed to contact Consul raft	1	2036	August 6, 2019
Unable to make fault-tolerant 5 node Consul server setup Consul k8s , raft , consul	5	455	November 14, 2022
Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader) Consul	6	2873	July 15, 2023
Failed leadership election with three node cluster in GKE (Consul v1.5.2) Consul	4	435	February 20, 2023

Agent.server.raft: failed to make requestVote RPC

Related topics