Not ready to serve consistent reads - consistently showing up

Hi folks,

Over the past month or so, the above error has been occuring very frequently. I run a cluster with 5 servers; I’ve checked disk space (>500Gb available), and the logs. The logs show nothing very untoward, just that there appears to be a network issue from time to time. However, once that network issue clears, the cluster is left in a state where I can see 5 servers in the raft list, but none of them is a leader. It also seems like no leader election is taking place.

The only way to restore things to normal is to restart every server, at which point leader election kicks off and things go back to normal.

This seems… not the desired way things are supposed to work. I expect the leader election to take place if one or more nodes had network issues that got resolved. Or am I mistaken here?

Hi @benvanstaveren and sorry you have been having issues. It is very hard to understand exactly what is happening here without more information such as server logs, Raft peer listings, and more context around “network issue from time to time”.

I expect the leader election to take place if one or more nodes had network issues that got resolved.

This is a broad statement that hides complexity and context. If one server, for example, was partitioned from the rest of the servers and it was a follower, you would not expect to see a leadership election. If it was the leader, then yes I would expect to see a leadership election. Seeing that you are running 5 servers, you could expect to see 2 non-leader servers become partitioned without any impact to your cluster. These situations are not what seems to be happening though and ultimately I would expect your server cluster to make a recovery.

“Not ready to serve consistent reads”

This is a really interesting message to receive and I’ll take a little time looking into exactly what this means.

It originates from the RPC layer which handles forwarding requests to a remote region, or the local region leader. I assume the RPC request is meant for the leader in the local region as there is no mention of federated region.

When a write request or read request that doesn’t allow a stale lookup is received by a server in the cluster, it will lookup and forward the request to the leader. In the event no leader is found, the RPC will return an error which includes the message “No cluster leader” which is not the case here. It therefore seems that the servers where the requests are being made can identify a leader.

Once a leader has been identified, the request is forward to that server. The leader that receives the request then performs a check to ensure its local state and leadership sub-processes are all in a state which is acceptable to start accepting read/write requests. This is where the RPC is failing and this setting is toggled only during leadership revocation or leadership establishment of a server.

I am interested in figuring out exactly what is happening here and you mention that it has happened a number of times, therefore if/when this happens again there are some useful pieces of information to collect that would help identify potential problems:

  • Gather an operator debug bundle during and after the server outage; during the outage you’ll likely need to set the stale flag to true. This can be sent to nomad-oss-debug@hashicorp.com, but please be aware this will contain a large amount of data. If you do not wish to send it all, the server goroutine-debug* files will be extremely useful along with the server monitor.log files.

  • Check your monitoring for metrics such as nomad.leader.revoke_leadership and nomad.leader.establish_leadership which would help identify any elections occurring.

  • The output of the nomad operator raft list-peers command form the perspective of each server in the cluster.

Thanks,
jrasell and the Nomad team

Just to add, the failure mode is interesting in the sense that at the point this error occurs, there are 5 servers in the raft peer list, and nomad server members will show them (and the servers from the other clusters it’s federated with). But they are all marked as followers, none of them are marked as leader. The logs also don’t show any leader election taking place. To make sure I wasn’t just being impatient I let it sit like that for an hour (at night, luckily), but nothing happened. No log entries for leader election, no voting, absolutely nothing. Like none of the servers was aware that there was no leader.

Since the failure happens very randomly (and appears to be triggered by something on the network end), it may be a while before I can get you the details. It seems generating a debug bundle also requires me to enable debug on the servers?

That is a very interesting set of circumstances; I am not sure I would have the patience you showed there.

The debug bundle uses the /v1/agent/pprof endpoints which do not require the debug flag on the servers to be set, but will require the ACL sub-system to be enabled and bootstrap; with an appropriately powerful token used when running the command.

Just wanted to mention this error has not occurred in the last 12 days - I did upgrade to Nomad 1.3.5 around that time, and maybe doing an upgrade and full restart of all servers made it work better, or something, somewhere, changed. I’m going to go out on a limb and call it a Heisenbug - if I see it happen again I’ll run the pprof endpoint - at least, once I set up ACL’s as well.

1 Like

Alright, it’s been happening again. I haven’t managed to get a lot of logs sorted out but I did get a better idea of what happens.

So. First, we start with some networking errors where the 5 servers can’t seem to reach eachother, at all. This kicks off a round of voting, which quickly fills the log with entries like this:

2022-10-10T14:43:49.536Z [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 60543843-64f2-aa3a-5f77-dfd71ee068bd xxx.yyy.xxx.yyy:4647}" error="dial tcp xxx.yyy.xxx.yyy:4647: i/o timeout"

Eventually, this message appears:

nomad.raft: entering follower state: follower="Node at xxx.yyy.xxx.yyy:4647 [Follower]" leader-address= leader-id=

Which seems to indicate the node parks itself into a follower state, without a leader (which seems odd, considering…)

Election is then restarted, but the underlying network issue is still there, and causes the election to fail with the timeouts described above.

However, restarting the nomad server processes clears all of this right up, if you leave it untouched, it will rinse/repeat this sequence for as long as you have the patience to wait, if you restart the server process, election is near instant and everything goes back to normal.

So something, somewhere, is going pear-shaped. Whether it’s the actual network the servers are on (cloud instances), or something is running out of resources with a symptom of IO timeouts (can’t think of anything right this second) is the next question, I guess…

Still, it’s quite an interesting problem :smiley:

As an outsider/observer, I am very curious about the infrastructure setup. cloud? onpremise? network? network latencies? :thinking:

Actually it seemed to be related to an issue in Raft that was fixed in 1.4.2 :slight_smile:

I have been facing gradual memory consumption rise in Nomad servers due to blocked and/or pending evaluations.

And, yes, eval related issues seem to be in the 1.4.3 milestone, which I am looking forward to. :slight_smile: :slight_smile: