Vault leader unable to join raft cluster

yashbharadwaj · April 15, 2023, 9:50am

Hi all,

We are seeing an issue where our vault master pod is unable to come up.
We debugged the pod logs and found this error:
Error:
core.raft: failed to activate TLS key: error="failed to read raft TLS keyring: context canceled

Ideally the master should be able to join the raft group, or a new leader election process should be triggered.

Please help with following queries:

How can we avoid this issue in future.
Please suggest some metric over which we can put an alert to notify that the master pod is unable to join the cluster.

Setup information:

We are using vault free version in HA mode with 3 nodes: 1 master and 2 standby.

Vault version:-
1.12.0

Resource Limits:-
resources:
requests:
memory: 12Gi
cpu: 4000m
limits:
memory: 15Gi
cpu: 8000m

HA config:-
ha:
enabled: true
replicas: 3
raft:
enabled: true
setNodeId: true

      config: |
        ui = true
        disable_cache = true
        listener "tcp" {
          tls_disable = 1
          address = "[::]:8200"
          cluster_address = "[::]:8201"
          telemetry {
          unauthenticated_metrics_access = "true"
          }
        }

        storage "raft" {
          path = "/vault/data"
          retry_join {
            leader_api_addr = "http://vault-prod-0.vault-prod-internal:8200"
          }
          retry_join {
            leader_api_addr = "http://vault-prod-1.vault-prod-internal:8200"
          }

maxb · April 15, 2023, 10:56am

There’s a great deal to unpack in this message - let’s take it step by step…

If the pod isn’t up, then it’s not the master. Vault HA functions by having a number of identical deployments, and which one becomes active is determined by election at runtime. There’s no static assignment that one pod is more preferred than another.

This error is just a timeout - it is a symptom of a problem, yes, but it doesn’t really give any information about why this operation took so long Vault gave up. There is not enough information from just this one error line to speculate about the actual source of the problem.

As above, we don’t know what the issue is based on the information shared so far.

As above, if it can’t join the cluster, it can’t get elected so it’s by definition not the master.

I have no reason (yet) to suspect it is directly related, but as general guidance, I would generally try to upgrade to bug fix releases where possible - and there have been 5 1.12.x releases since this version.

When pasting configuration, please ensure proper indentation is maintained. This goes double when pasting languages like YAML, where indentation is semantically significant!

Where to go from here?

It is hard to say, as all you’ve really shared so far is that an internal operation that sounds like it would only have been reading local storage, has somehow timed out. You would need to share a lot more information about your cluster, logs, and possibly other status information.

Also, you seem to be implying that only one pod is down, out of 3… in which case, I guess overall your Vault is up, though without redundancy?

Topic		Replies	Views
Unable to join raft leader from vault-1 pod in k8s Vault k8s	0	15	February 6, 2025
Failing leader election. No leader Vault	1	1329	September 12, 2022
Trying to setup Vault HA mode with Raft Vault k8s , vault	4	1401	December 16, 2022
Cannot join new members to the leader [HA and TLS] Vault k8s , vault	5	645	February 20, 2023
Cannot create raft cluster, master does to standby and internal tls error is shown in logs Vault raft	0	1189	January 7, 2021

Vault leader unable to join raft cluster

Related topics