Hi @benvanstaveren and sorry you have been having issues. It is very hard to understand exactly what is happening here without more information such as server logs, Raft peer listings, and more context around “network issue from time to time”.
I expect the leader election to take place if one or more nodes had network issues that got resolved.
This is a broad statement that hides complexity and context. If one server, for example, was partitioned from the rest of the servers and it was a follower, you would not expect to see a leadership election. If it was the leader, then yes I would expect to see a leadership election. Seeing that you are running 5 servers, you could expect to see 2 non-leader servers become partitioned without any impact to your cluster. These situations are not what seems to be happening though and ultimately I would expect your server cluster to make a recovery.
“Not ready to serve consistent reads”
This is a really interesting message to receive and I’ll take a little time looking into exactly what this means.
It originates from the RPC layer which handles forwarding requests to a remote region, or the local region leader. I assume the RPC request is meant for the leader in the local region as there is no mention of federated region.
When a write request or read request that doesn’t allow a stale lookup is received by a server in the cluster, it will lookup and forward the request to the leader. In the event no leader is found, the RPC will return an error which includes the message “No cluster leader” which is not the case here. It therefore seems that the servers where the requests are being made can identify a leader.
Once a leader has been identified, the request is forward to that server. The leader that receives the request then performs a check to ensure its local state and leadership sub-processes are all in a state which is acceptable to start accepting read/write requests. This is where the RPC is failing and this setting is toggled only during leadership revocation or leadership establishment of a server.
I am interested in figuring out exactly what is happening here and you mention that it has happened a number of times, therefore if/when this happens again there are some useful pieces of information to collect that would help identify potential problems:
Gather an operator debug bundle during and after the server outage; during the outage you’ll likely need to set the stale flag to
true. This can be sent to
firstname.lastname@example.org, but please be aware this will contain a large amount of data. If you do not wish to send it all, the server
goroutine-debug* files will be extremely useful along with the server
Check your monitoring for metrics such as
nomad.leader.establish_leadership which would help identify any elections occurring.
The output of the
nomad operator raft list-peers command form the perspective of each server in the cluster.
jrasell and the Nomad team