How to check that auth plugin is healthy?

We have our own auth plugin. It makes a few external calls when it starts running.
For some reason sometimes these requests fail and the plugin can’t start. (We are checking how to prevent external requests failure, it’s not the subject of my question)
In this case, everything doesn’t work, we get the error “{“errors”:[“no handler for route ‘auth/ourplugin’”]}”
I want to add the k8s liveness probe for vault pods to check that auth plugin is healthy and running.
It will restart a pod if the check failed and fix its state.

  1. In this check I don’t have any token
  2. I need to check only an active pod, not the standby

Please advise how can I do this health check for auth plugin.

You should not make external calls from a Vault plugin as it initializes. Instead, you should let the initialization complete successfully, and only refer to external dependencies whilst handling individual user requests. This is because Vault treats a failed plugin initialization as unrecoverable, whilst the health of external services can recover.

Instead, write your plugin to tolerate external dependencies failing and returning to service, you won’t have to restart entire Vault pods to kick the plugin into recovering. Use lazy initialization guarded by a mutex, if you need to, to handle setup tasks that need to happen before the plugin can serve requests.

But, if you really absolutely have to implement a healthcheck, you can mark specific paths as not requiring authentication - a feature you must be using already in your plugin code, as the login paths of auth methods need to be marked as unauthenticated.

Thank you for your answer.
We will consider your suggestion about lazy init. The plugin can’t work without this init. What should we do if it fails? I would want another pod to take leadership and try. How can we do this?
About health-check - your suggestion is to call the unprotected endpoint in auth plugin from liveness, right?
Thank you again

Return an error to each user request, and keep trying to initialise - possibly with some delay / backoff logic, so that a flood of login requests to a broken plugin doesn’t generate a flood of requests to external systems.

Is it realistic that the external system is down for one pod but up for another?

This would get rather complicated. Actually, I’m not even sure whether Vault even runs plugins at all on a non-performance standby - I’d have to test to confirm the behaviour in this case.

I was thinking of creating a new unprotected endpoint dedicated to returning the health status.

We have client errors (timeout, dns resolving etc), so yes, another client can succeed.
I’m working on investigating and solving this issue.
When I manually delete the failing active pod, another pod takes leadership and in 99.99% of cases succeeds to start running

I want another pod completely take leadership and start running all plugins.

I’ll try to create the health endpoint in auth.
Thank you for your help

I tested, and confirmed that ordinary standby Vault nodes do not even start their plugins - and even the active node does not start plugins until it needs them to process a request.

Because of this, I expect you will will need a script-based liveness check - i.e. have Kubernetes run a script inside your pod that evaluates whether the pod is the active node, and then only if it is, healthchecks the plugin.

1 Like

I see, thank you
I added the endpoint to auth plugin.
How can I check that the pod is active from a liveness script?
Thanks

You can use the status API (/sys/health - HTTP API | Vault | HashiCorp Developer) which will only return a 200 status if the pod is the leader and ready to be used

1 Like