Hi,
We are using 2xVault and 3xconsul arcitecture.
Resently we had a downtime on our service, so we tring to understand what exacly happened.
Vault were responding with 500 to incomming requests (around 10000 requests), but in the audit log we see only ~300 errors, they all related to “failed to read … contex canceled”.
From the consul documentation it seems like this error can be related to IOPS, but in our case we are far from the limit.
The issue was resulved by itself within 15min.
No one of the vault/counsul process failed.
No sipke in traffic.
Seems like, no fault on the aws connected disks.
We are running on aws ec2 vms and 2 elbs.
Questions:
What could cause this issue?
Why if there is an issue with one of the consul/vaults there was no fallback to the standby one?
Is there any way to avoid such issue in the future?
Are you able to get the full error message? The “failed to read … context canceled” seems to be missing some data in the middle that could be useful for assisting with your issue.
However, my first thought would be to see if there was any disruption in the ability to write to your audit devices. If Vault is unable to write to all configured audit devices then it will not allow any operations to process. More info on that here: Audit Devices | Vault by HashiCorp
seems like we are writing to a local disk.
Then syslog is sending this data to splunk. so incase of network issue vault should just continue writing to disk rihgt? (unless there is an issue on the disk itself). right?
The errors look to be related to an issue communicating with your Consul storage backend. I checked the AWS Status page for any outages in us-west-2 EC2 or ELB (https://status.aws.amazon.com/) but didn’t see any documented outages around the time you had trouble.
Are you able to pull logs out of Consul for around the same time period to see if anything was happening within your storage environment?
If you’re using Vault Enterprise then I’d suggest opening a ticket in the support portal as you’re guaranteed a response there. Otherwise one of the HashiCorp staff may happen upon this thread and respond as they have time.