Vault health check for "disk full" situation

ilatvala · September 21, 2021, 5:41am

I have one simple vault setup (1.8.2) that is using a S3 storage backend and a local audit file. I’m checking from the /sys/health endpoint if my vault is running ok (in case I need to start a new instance). If you have a lot of unexpected vault usage, then audit log file might fill up the whole filesystem. Obviously vault cannot work properly in this situation, since it cannot write to the audit log. I was expecting to get some indication of this with the /sys/health request, but looks like that API endpoint is reporting all OK, even though disk is full and vault is not operational. Is this by design, or could the /sys/health report this with some other status code than 200?

aram · September 21, 2021, 10:36am

Vault’s health check is simply that, Vault’s health check. What you’re asking about is the system health check.

For AWS you can use the built-in monitoring (no alerting) or use CloudWatch to set watermarks and get alerts.

ilatvala · September 21, 2021, 11:01am

I’m not using AWS, so cannot use cloudwatch. What system health check are you referring to? What vault API endpoint? The /sys/health endpoint is giving multiple different status codes based on seal status etc, so I would expect to get some error status code if vault audit device is not working (hence vault is not working). I only tested the full disk scenario, but I assume the situation is the same with any other audit device failure. I do not consider vault health to be ok, if audit device problem has stopped vault from working.

mikegreen · September 21, 2021, 2:09pm

Monitoring CPU, memory, disk is not a job of the service running on a VM/server/container - it should be part of your infrastructure monitoring such as datadog/splunk/cloudwatch/etc.

You should be logging to multiple audit endpoints, and rotating logs, crons/etc to ensure you can handle a disk full event, and running multiple nodes + LB routing so you can ensure uptime if one node stops serving requests (for this or other purpoeses).

ilatvala · September 21, 2021, 2:20pm

Naturally there are multiple ways to make vault HA, but my point was that vault could easily reply with some error code if it cannot use the defined audit devices. Now I would say that it is a bit misleading for the /sys/health to reply “All OK”, when vault cannot work in that situation.

mikegreen · September 21, 2021, 2:52pm

I don’t disagree overall - but the audit log error can be ephemeral request to request, versus the sys/health is meant to be a durable response based on the node’s condition - not the requests ability to be satisfied.

If you believe it is indeed easy for Vault to do so, I’m sure a PR to do that would be welcome

You can see background

github.com/hashicorp/vault

Vault disk full failure modes

opened 10:54PM - 20 Dec 17 UTC

kenbreeman

enhancement core/audit

When Vault is configured with a file-based audit backend and the disk fills up, …the Vault leader becomes unhealthy but fails to step-down despite having standby instances available. **Environment:** * Vault Version: 0.8.2 * Operating System/Architecture: CentOS x64 **Vault Config File:** N/A **Startup Log Output:** N/A **Expected Behavior:** When a Vault leader is no longer able to serve requests it should step-down and allow a standby to serve requests. **Actual Behavior:** Disk-full prevented Vault from writing to the audit log so it didn't automatically step down, issuing a manual `vault step-down` also failed due to being unable to write to the audit log. Killing the Vault leader process was the only way to recover. **Steps to Reproduce:** - Create a Vault HA cluster - Enable the file audit backend - Fill the disk (e.g. `cat /dev/urandom > /path/to/audit/log/volume/temp.bin`) - Observe Vault leader failing to serve requests (e.g. `vault read secret/foo`) - Observe unable to step down cleanly (e.g. `vault step-down`) **Important Factoids:** N/A **References:** N/A

and

github.com/hashicorp/vault

Vault considers itself healthy when can't write to audit log

opened 11:03AM - 28 Jun 21 UTC

DrMagPie

enhancement core/audit

**Describe the bug** Due to some misconfiguration on our side, logrotate created file for audit log with wrong permissions. After what vault did not server any requests because could not write to audit log (intended behavior). But health check reported that vault is healthy. I believe if vault can not respond to a request it should be marked as unhealthy. **To Reproduce** Steps to reproduce the behavior: 1. `touch /tmp/vault-audit.log ` 1. `sudo chmod 0640 /tmp/vault-audit.log` 1. `vault audit enable file file_path=/tmp/vault-audit.log` 1. `vault status` ```Key Value --- ----- Seal Type shamir Initialized true Sealed false ``` 5. `vault secrets list` ```Path Type Accessor Description ---- ---- -------- ----------- cubbyhole/ cubbyhole cubbyhole_ca631048 per-token private secret storage identity/ identity identity_36ea49cf identity store secret/ kv kv_2f024240 key/value secret storage sys/ system system_00aaccc1 system endpoints used for control, policy and debugging ``` 6. `sudo chown root:root /tmp/vault-audit.log` 6. `kill -1 $VAULT_PID` 6. `vault status` ```Key Value --- ----- Seal Type shamir Initialized true Sealed false ``` 7. `vault secrets list` ```Error listing secrets engines: Error making API request. URL: GET http://127.0.0.1:8200/v1/sys/mounts Code: 500. Errors: * internal error ``` backend log ``` ==> Vault reload triggered 2021-06-28T10:49:28.564Z [ERROR] no config found at reload time 2021-06-28T10:49:28.564Z [INFO] audit: reloading file audit backend: path=file/ Error(s) were encountered during reload: 1 error occurred: * error encountered reloading file audit device at path "file/": open /tmp/vault-audit.log: permission denied 2021-06-28T10:49:48.321Z [ERROR] audit: backend failed to log request: backend=file/ error="open /tmp/vault-audit.log: permission denied" 2021-06-28T10:49:48.321Z [ERROR] core: failed to audit request: path=sys/mounts error= | 1 error occurred: | * no audit backend succeeded in logging the request | 2021-06-28T10:49:48.321Z [ERROR] audit: backend failed to log response: backend=file/ error="open /tmp/vault-audit.log: permission denied" 2021-06-28T10:49:48.321Z [ERROR] core: failed to audit response: request_path=sys/mounts error= | 1 error occurred: | * no audit backend succeeded in logging the response | ``` **Expected behavior** Some indication that server is unable to serve requests, and http status code in 5XX when accessing health endpoint

Might be best to thumbs-up those for traction.

ilatvala · September 21, 2021, 3:30pm

Thumbs-up done for those. Will also add some additional kv fetching for my ansible playbook vault health check, since I cannot fully rely on /sys/health as I expected before.

Topic		Replies	Views
Why http client error for health check on standby instance of vault? Vault health-check , vault	2	1110	December 3, 2022
Audit device throwing errors Vault	2	86	July 5, 2024
Health check uri in vault Vault	5	4109	March 7, 2022
Setting up file audit device on HA Vault cluster Vault	6	1937	June 24, 2020
Audit logs for HA vault Vault	2	603	June 25, 2024

Vault health check for "disk full" situation

Related topics