Vault leader unable to join raft cluster

Hi all,

We are seeing an issue where our vault master pod is unable to come up.
We debugged the pod logs and found this error:
Error:
core.raft: failed to activate TLS key: error="failed to read raft TLS keyring: context canceled

Ideally the master should be able to join the raft group, or a new leader election process should be triggered.

Please help with following queries:

  1. How can we avoid this issue in future.
  2. Please suggest some metric over which we can put an alert to notify that the master pod is unable to join the cluster.

Setup information:

We are using vault free version in HA mode with 3 nodes: 1 master and 2 standby.

Vault version:-
1.12.0

Resource Limits:-
resources:
requests:
memory: 12Gi
cpu: 4000m
limits:
memory: 15Gi
cpu: 8000m

HA config:-
ha:
enabled: true
replicas: 3
raft:
enabled: true
setNodeId: true

      config: |
        ui = true
        disable_cache = true
        listener "tcp" {
          tls_disable = 1
          address = "[::]:8200"
          cluster_address = "[::]:8201"
          telemetry {
          unauthenticated_metrics_access = "true"
          }
        }

        storage "raft" {
          path = "/vault/data"
          retry_join {
            leader_api_addr = "http://vault-prod-0.vault-prod-internal:8200"
          }
          retry_join {
            leader_api_addr = "http://vault-prod-1.vault-prod-internal:8200"
          }

There’s a great deal to unpack in this message - let’s take it step by step…

If the pod isn’t up, then it’s not the master. Vault HA functions by having a number of identical deployments, and which one becomes active is determined by election at runtime. There’s no static assignment that one pod is more preferred than another.

This error is just a timeout - it is a symptom of a problem, yes, but it doesn’t really give any information about why this operation took so long Vault gave up. There is not enough information from just this one error line to speculate about the actual source of the problem.

As above, we don’t know what the issue is based on the information shared so far.

As above, if it can’t join the cluster, it can’t get elected so it’s by definition not the master.

I have no reason (yet) to suspect it is directly related, but as general guidance, I would generally try to upgrade to bug fix releases where possible - and there have been 5 1.12.x releases since this version.

When pasting configuration, please ensure proper indentation is maintained. This goes double when pasting languages like YAML, where indentation is semantically significant!

Where to go from here?

It is hard to say, as all you’ve really shared so far is that an internal operation that sounds like it would only have been reading local storage, has somehow timed out. You would need to share a lot more information about your cluster, logs, and possibly other status information.

Also, you seem to be implying that only one pod is down, out of 3… in which case, I guess overall your Vault is up, though without redundancy?