Monitoring Vault seal status

Hello,

I am struggling with a pretty simple task of monitoring Vault seal/unseal status.

We are using Prometheus to scrape Vault metrics by polling the /sys/metrics endpoint and receive back the vault.core.unsealed boolean value which looks like what we need. It returns back a 1, when vault is unsealed.

But whenever the Vault is sealed, the metrics API endpoint returns a 503 error “Vault is sealed”, which I guess is expected if Vault can’t read the storage.

But what is the point of vault.core.unsealed metric if it will never return value 0 to indicate that the Vault is sealed since no API calls can be made to a sealed Vault instance?

I am aware of an API /sys/seal-status, but it is not capable of returning back Prometheus type metrics, so to use it we would need to write some exporter to do so.

How do you guys monitor Vault seal status? Do you use /sys/seal-status, or some custom logic when the metrics API is returning specific errors?

Any help is appreciated,
Thanks

Check this, it may help :
/sys/health - HTTP API | Vault | HashiCorp Developer

If you turn on unauthenticated_metrics_access in your Vault configuration file, the metrics endpoint should be responsive whilst Vault is sealed.

1 Like

Thanks, but /sys/health doesn’t return Prometheus type metrics same as /sys/seal-status

Thanks! That makes the metrics endpoint responsive when Vault is sealed. But now it’s only returning very few gauges, not including the vault.core.unsealed one. I am only getting back these in the response:

“vault.runtime.alloc_bytes”
“vault.runtime.free_count”
“vault.runtime.heap_objects”
“vault.runtime.malloc_count”
“vault.runtime.num_goroutines”
“vault.runtime.sys_bytes”
“vault.runtime.total_gc_pause_ns”
“vault.runtime.total_gc_runs”

Would you know if that’s expected or am I missing something in my configuration?

api_addr = "http://x.x.x.x:8200"
cluster_addr = "http://x.x.x.x:8201"
ui = true
disable_mlock = true

storage "raft" {
  path = "opt/vault/data"
  node_id = "node1"
}

listener "tcp" {
  address       = "x.x.x.x:8200"
  tls_cert_file = "/opt/vault/tls/x.crt"
  tls_key_file  = "/opt/vault/tls/x.key"
  tls_disable_client_certs = true
  telemetry {
    unauthenticated_metrics_access = true
  }
}

telemetry {
  disable_hostname = true
  prometheus_retention_time = "30s"
}

I would expect a sealed vault to return significantly fewer metrics.

But, for vault.core.sealed to be missing too? That seems like a terribly ironic unhelpful design if so.

I haven’t tested myself (we use an external custom prober that calls /sys/health amongst other things, and exporting its own metrics).

I’d have to search the Vault source code to be sure - I’ll have a look later, when I’m at a real computer rather than a phone.

Managed to find why it wasn’t returning the unseal metric.

Turns out since Vault uses go-metrics library, once a metric has seen no activity, it will disappear after prometheus_retention_time.

Since my Vault instance had been sealed for a longer period of time than my configured prometheus_retention_time of 30 seconds, the metric was not being returned since it had been no change in it.
Had to just increase the prometheus_retention_time setting to keep the metric in Vault memory for longer.

Thanks again for your help