Vault throwing 503s under load

Hello,

I’ve run into an issue when making simultaneous calls to vault for a single secret where vault will start throwing 503s. I am seeing this when I am running our CI/CD pipeline with an ansible task that looks up a secret, but from 6+ build agents making the same call. We are currently running Vault 1.6.2 in kubernetes with 3 replica pods, with Raft as our integrated storage. Please let me know what other information I should provide. Thank you!

here are some logs from when this happens:

storage.raft: failed to contact: server-id=1edfbae2-bf3b-3546-9125-9ba155e7cfc4 time=2.500146397s
2021-08-09T12:30:52-05:00 2021-08-09T17:30:52.638Z [WARN]  storage.raft: failed to contact: server-id=1edfbae2-bf3b-3546-9125-9ba155e7cfc4 time=2.500141441s
2021-08-09T12:30:58-05:00 2021-08-09T17:30:58.010Z [WARN]  storage.raft: failed to contact: server-id=1edfbae2-bf3b-3546-9125-9ba155e7cfc4 time=2.50018842s
2021-08-09T12:31:07-05:00 2021-08-09T17:31:07.475Z [WARN]  storage.raft: failed to contact: server-id=1edfbae2-bf3b-3546-9125-9ba155e7cfc4 time=2.500121306s
2021-08-09T12:31:12-05:00 2021-08-09T17:31:12.946Z [WARN]  storage.raft: failed to contact: server-id=1edfbae2-bf3b-3546-9125-9ba155e7cfc4 time=2.500152207s
2021-08-09T12:31:17-05:00 2021-08-09T17:31:17.706Z [WARN]  storage.raft: failed to contact: server-id=1edfbae2-bf3b-3546-9125-9ba155e7cfc4 time=2.500122195s
2021-08-09T12:31:20-05:00 2021-08-09T17:31:20.510Z [WARN]  storage.raft: failed to contact: server-id=1edfbae2-bf3b-3546-9125-9ba155e7cfc4 time=2.500135703s
2021-08-09T12:31:26-05:00 2021-08-09T17:31:26.330Z [WARN]  storage.raft: failed to contact: server-id=1edfbae2-bf3b-3546-9125-9ba155e7cfc4 time=2.500138365s
2021-08-09T12:31:28-05:00 2021-08-09T17:31:28.749Z [WARN]  storage.raft: failed to contact: server-id=1edfbae2-bf3b-3546-9125-9ba155e7cfc4 time=4.919292744s
2021-08-09T12:31:31-05:00 2021-08-09T17:31:31.236Z [WARN]  storage.raft: failed to contact: server-id=1edfbae2-bf3b-3546-9125-9ba155e7cfc4 time=7.405625211s
2021-08-09T12:31:32-05:00 2021-08-09T17:31:32.508Z [INFO]  storage.raft: aborting pipeline replication: peer="{Voter 1edfbae2-bf3b-3546-9125-9ba155e7cfc4 test-vault-1.test-vault-internal:8201}"
2021-08-09T12:31:34-05:00 2021-08-09T17:31:34.482Z [ERROR] storage.raft: failed to heartbeat to: peer=test-vault-1.test-vault-internal:8201 error="read tcp 100.96.3.29:36804->100.96.6.121:8201: i/o timeout"
2021-08-09T12:31:35-05:00 2021-08-09T17:31:35.009Z [WARN]  storage.raft: failed to contact: server-id=1edfbae2-bf3b-3546-9125-9ba155e7cfc4 time=2.5010235s
2021-08-09T12:31:37-05:00 2021-08-09T17:31:37.488Z [WARN]  storage.raft: failed to contact: server-id=1edfbae2-bf3b-3546-9125-9ba155e7cfc4 time=4.979507813s
2021-08-09T12:31:39-05:00 2021-08-09T17:31:39.946Z [WARN]  storage.raft: failed to contact: server-id=1edfbae2-bf3b-3546-9125-9ba155e7cfc4 time=7.437628953s
2021-08-09T12:31:42-05:00 2021-08-09T17:31:42.599Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 1edfbae2-bf3b-3546-9125-9ba155e7cfc4 test-vault-1.test-vault-internal:8201}" error="tls: DialWithDialer timed out"
2021-08-09T12:31:45-05:00 2021-08-09T17:31:45.263Z [ERROR] storage.raft: failed to heartbeat to: peer=test-vault-1.test-vault-internal:8201 error="tls: DialWithDialer timed out"
2021-08-09T12:31:52-05:00 2021-08-09T17:31:52.662Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 1edfbae2-bf3b-3546-9125-9ba155e7cfc4 test-vault-1.test-vault-internal:8201}" error="tls: DialWithDialer timed out"
2021-08-09T12:31:55-05:00 2021-08-09T17:31:55.943Z [ERROR] storage.raft: failed to heartbeat to: peer=test-vault-1.test-vault-internal:8201 error="tls: DialWithDialer timed out"
2021-08-09T12:32:02-05:00 2021-08-09T17:32:02.672Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 1edfbae2-bf3b-3546-9125-9ba155e7cfc4 test-vault-1.test-vault-internal:8201}" error="tls: DialWithDialer timed out"
2021-08-09T12:32:06-05:00 2021-08-09T17:32:06.193Z [WARN]  storage.raft: rejecting vote request since we have a leader: from=test-vault-1.test-vault-internal:8201 leader=test-vault-0.test-vault-internal:8201
2021-08-09T12:32:06-05:00 2021-08-09T17:32:06.805Z [ERROR] storage.raft: failed to heartbeat to: peer=test-vault-1.test-vault-internal:8201 error="tls: DialWithDialer timed out"
2021-08-09T12:32:12-05:00 2021-08-09T17:32:12.791Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 1edfbae2-bf3b-3546-9125-9ba155e7cfc4 test-vault-1.test-vault-internal:8201}" error="tls: DialWithDialer timed out"
2021-08-09T12:32:15-05:00 2021-08-09T17:32:15.496Z [WARN]  storage.raft: rejecting vote request since we have a leader: from=test-vault-1.test-vault-internal:8201 leader=test-vault-0.test-vault-internal:8201```

Your active node is under load that exceeds its capacity to serve requests.
What do the usual load metrics look like for the cluster’s nodes?