Vault agent telemetry: not as useful as it could be?

I am currently struggling to figure out how I am supposed to use the existing vault.agent.auth.success and vault.agent.auth.failure metrics to implement monitoring that answers the question “which vault agents are currently not authenticated to the vault?”.

This might be because the metrics system that whoever requested and implemented this feature works different from how Prometheus works.

The only way that I can see to do so is to figure out which of the two counters most recently changed - if the failure metric was most recently incremented, the vault agent is in an error state. But such a query - if it is even possible in PromQL - would be horrible expensive, because it would require scanning backwards across the time series for the two metrics to find the point in time when the metric flipped.

Another option is to ask “is the vault.agent.auth.failure counter currently higher than it was N minutes ago?” But choosing the correct N is problematic. If N is too small (i.e. the agent has re-tried and failed again), vault.agent.auth.failure will be a flatline over N minutes, and the monitoring alert stop ringing. If N is too large, then it will take that much longer for a transient error that has since resolved itself to stop rigning. Also, choosing the correct N requires the monitoring system to hard-code assumptions about the maximum retry backoff that an agent could use, and that is a parameter that could be configured differently for different agents.

Much more useful would be a vault.agent.auth.status metric that is 0 or 1 if the agent is currently authenticated or not.

Or am I missing something here?