We have an EKS cluster running across three AWS AZs.
Despite having a vault-server-{0,1,2} spread across AZs, we noticed if that a single node goes down, our Vault goes down and causes us a lot of problems!
"log": "2022-11-09T10:16:28.561Z [ERROR] storage.raft: failed to make requestVote RPC: target=\"{Voter vault-server-2 vault-server-2.vault-server-internal:8201}\" error=\"dial tcp: i/o timeout\"\n",
Perhaps we have a mis-configuration? How can we ensure our vault service works reliably even if one node goes down?
What does “Vault goes down” mean? What are the actual observed symptoms?
Please show complete logs from whichever node is currently the leader, covering a period showing it going from working to not working, when another node goes down.
I create a query on my vault cw log which is exported by fluentbit from the EKS cluster. Tbh I’m not sure why it’s not JSON structured, though though here are the errors I see:
Interesting. Unfortunately I don’t know much about AWS, EKS and AZs to know how it’s supposed to work, but …
This message is repeated well over 100 times throughout the log:
[ERROR] storage.raft: failed to appendEntries to: peer=""{Voter vault-server-0 vault-server-0.vault-server-internal:8201}"" error=""dial tcp: lookup vault-server-0.vault-server-internal on 172.20.0.10:53: no such host""
That seems very wrong to me … if the pod had just gone down and was restarting, it ought to have come back pretty quickly, but instead Kubernetes is effectively claiming there’s no such pod for an extended period of time.
And meanwhile, there’s also a large number of:
[ERROR] storage.raft: failed to make requestVote RPC: target=""{Voter vault-server-1 vault-server-1.vault-server-internal:8201}"" error=""context deadline exceeded""
Which to me hints at there being network issues reaching the vault-server-1 pod or it being somehow under so much load that it can’t even respond to incoming network connections in a timely fashion.