I have one simple vault setup (1.8.2) that is using a S3 storage backend and a local audit file. I’m checking from the /sys/health endpoint if my vault is running ok (in case I need to start a new instance). If you have a lot of unexpected vault usage, then audit log file might fill up the whole filesystem. Obviously vault cannot work properly in this situation, since it cannot write to the audit log. I was expecting to get some indication of this with the /sys/health request, but looks like that API endpoint is reporting all OK, even though disk is full and vault is not operational. Is this by design, or could the /sys/health report this with some other status code than 200?
Vault’s health check is simply that, Vault’s health check. What you’re asking about is the system health check.
For AWS you can use the built-in monitoring (no alerting) or use CloudWatch to set watermarks and get alerts.
I’m not using AWS, so cannot use cloudwatch. What system health check are you referring to? What vault API endpoint? The /sys/health endpoint is giving multiple different status codes based on seal status etc, so I would expect to get some error status code if vault audit device is not working (hence vault is not working). I only tested the full disk scenario, but I assume the situation is the same with any other audit device failure. I do not consider vault health to be ok, if audit device problem has stopped vault from working.
Monitoring CPU, memory, disk is not a job of the service running on a VM/server/container - it should be part of your infrastructure monitoring such as datadog/splunk/cloudwatch/etc.
You should be logging to multiple audit endpoints, and rotating logs, crons/etc to ensure you can handle a disk full event, and running multiple nodes + LB routing so you can ensure uptime if one node stops serving requests (for this or other purpoeses).
Naturally there are multiple ways to make vault HA, but my point was that vault could easily reply with some error code if it cannot use the defined audit devices. Now I would say that it is a bit misleading for the /sys/health to reply “All OK”, when vault cannot work in that situation.
I don’t disagree overall - but the audit log error can be ephemeral request to request, versus the sys/health is meant to be a durable response based on the node’s condition - not the requests ability to be satisfied.
If you believe it is indeed easy for Vault to do so, I’m sure a PR to do that would be welcome
You can see background
Might be best to thumbs-up those for traction.
Thumbs-up done for those. Will also add some additional kv fetching for my ansible playbook vault health check, since I cannot fully rely on /sys/health as I expected before.