Vault monitoring

Hello,

I’m trying to set the monitoring for Vault. Is there a list of events that can be logged by the vault service? Such list would help to decide what should be escaladed as an incident ticket.

Thanks for your help.

We have a page here that has a downloadable monitoring guide PDF: https://learn.hashicorp.com/vault/operations/monitoring. There’s also a raw list of metrics here: https://www.vaultproject.io/docs/internals/telemetry.html.

Hello, Thank you for your answer. I’m planning to have a single Vault server so I’m more interested in monitoring the availability of the Vault service. Plan is to gather the syslog and base on messages logged there rise an alert if vault service returned an error is sealed or not running. Is there any list of messages/errors that can be generated by Vault service?

Thanks for your help.

Hello @tyrannosaurus-becks !

Do you know what’s the default value of vault_core_handle_request for vault request?
I mean I know it’s ms but how to know if a request is taking to long to be handled and so on? is there any document that describes the default, acceptable values for each metric?

Saludos,
Orlando B.

1 Like

Hi @rolansB,

That’s kind of up to you, i.e. what you consider acceptable. Under normal circumstances it will largely be driven by the storage you’ve provisioned for Vault, the main exception being requests involving dynamic credentials, which are at the mercy of the upstream system for creation/revocation. If it were my Vault server I’d hope that most requests would complete in some fairly small number of milliseconds, but you have to work out what makes sense for your situation, i.e. how slow does Vault have to get before it causes problems for the things that depend on it? Then configure your alerting so that you get notified before it reaches that point.

The pdf is from 2018. Is it still valid through the version upgrades?

@rolansB, the easiest way to determine this is to put the server under load in an environment that matches production, and establish a baseline.

Then, in the future, if the response for vault.core.handle_request goes above that baseline, you’ll be able to alert on that anomaly.

Knowing nothing about your cluster, if I had to play a game of “guess the number of jelly beans,” here, I’d start with 200ms. But like @ncabatoff said, it can go much higher or lower depending on the host environment, dependent services, and your SLOs. May I ask what your SLO for a response to the following requests is:

  • Authentication Response
  • Secrets Engine Response (No External services like a DB or Cloud Provider)
  • Secrets Engine Response (With External Services like a DB or Cloud Provider)

@ncabatoff, for situations where you have used this, what range have you seen for these?