Failed to make requestVote results in vault server going down

Hey there,

We have an EKS cluster running across three AWS AZs.

Despite having a vault-server-{0,1,2} spread across AZs, we noticed if that a single node goes down, our Vault goes down and causes us a lot of problems!

"log": "2022-11-09T10:16:28.561Z [ERROR] storage.raft: failed to make requestVote RPC: target=\"{Voter vault-server-2 vault-server-2.vault-server-internal:8201}\" error=\"dial tcp: i/o timeout\"\n",

Perhaps we have a mis-configuration? How can we ensure our vault service works reliably even if one node goes down?

This one log line is not enough to understand the issue.

What do I need to provide to better help?

To start with:

  • What does “Vault goes down” mean? What are the actual observed symptoms?

  • Please show complete logs from whichever node is currently the leader, covering a period showing it going from working to not working, when another node goes down.

I create a query on my vault cw log which is exported by fluentbit from the EKS cluster. Tbh I’m not sure why it’s not JSON structured, though though here are the errors I see:

fields @timestamp, @message
| sort @timestamp desc
| filter @message like "ERROR"
| display log

Just to re-iterate, when the third node goes down I expect Vault to still work.

It looks like you’ve supplied intermingled logs from multiple Vault nodes. I can’t make sense of that.

Also you’ve filtered the logging to only include errors, potentially hiding important hints.

Ok, I’ll filter by one stream (the one that fails on the node, which causes the entire vault service to fail) and include all log levels.

fields @timestamp, @message
| sort @timestamp desc
| filter @logStream="vault-server-2"
| display log

Thank you for taking a look @maxb !

Interesting. Unfortunately I don’t know much about AWS, EKS and AZs to know how it’s supposed to work, but …

This message is repeated well over 100 times throughout the log:

[ERROR] storage.raft: failed to appendEntries to: peer=""{Voter vault-server-0 vault-server-0.vault-server-internal:8201}"" error=""dial tcp: lookup vault-server-0.vault-server-internal on 172.20.0.10:53: no such host""

That seems very wrong to me … if the pod had just gone down and was restarting, it ought to have come back pretty quickly, but instead Kubernetes is effectively claiming there’s no such pod for an extended period of time.

And meanwhile, there’s also a large number of:

[ERROR] storage.raft: failed to make requestVote RPC: target=""{Voter vault-server-1 vault-server-1.vault-server-internal:8201}"" error=""context deadline exceeded""

Which to me hints at there being network issues reaching the vault-server-1 pod or it being somehow under so much load that it can’t even respond to incoming network connections in a timely fashion.

A Raft cluster of 3 nodes can tolerate a single node failure

According to https://developer.hashicorp.com/vault/docs/internals/integrated-storage#deployment-table

But in my case it doesn’t appear to…

Something is wrong with your setup. I’ve already made my best guess based on the logs shown so far:

Maybe you should try moving all the nodes to one AZ, to confirm or eliminate inter-AZ networking in this environment being flaky.