Vault Stuck after cluster restarts

We have deployed in Openshift a vault cluster with 3 nodes (helm chart version: 0.27.0, docker image: vault: 1.15.2).

Today we had our Openshift cluster upgrades where the vault pods restarted (and maybe disconnected a couple of times from Mysql, outside OCP).

Vault ended up in a state where now one was the active node.
Running “vault status” in all the pods gave us the following output

image

We tried to restart the pods, or to step down the nods from cli but nothing worked and some cli commands just returned 500.
After some time (1 / 2 hours) it automatically recovered without doing anything.

We were wondering what happened, what we could do next time to reduce the recovery time and if this can be avoided.

Here some logs from a vault pod.
vault-logs.txt (74.3 KB)

Vault helm chart configuration

vault:
  global:
    openshift: true
    tlsDisable: true # TODO: enable https - certificate signed by unknown authority
  injector:
    enabled: false
    image:
      repository: "artifactory.mycompany.net/hashicorp/vault-k8s"
      #tag: "1.1.0" Automatically set in the helm chart
  ui:
    enabled: true
    annotations:
      service.beta.openshift.io/serving-cert-secret-name: vault-ui-certs
  server:
    image:
      repository: "artifactory.mycompany.net/hashicorp/vault"
      #tag: "1.13.1" Automatically set in the helm chart
    logLevel: info
    logFormat: json
    standalone:
      enabled: true
      config: ""
    ha:
      enabled: true
      replicas: 3
      config: "# this comment prevents default config and is needed" # Use ConfigMap instead
      disruptionBudget:
        enabled: true
        maxUnavailable: 1
    service:
      enabled: true
    route:
      enabled: true
      # host: "" Set in env # let network-operator define route host by default, see readme.md how to customize
      activeService: true
      labels:
        connection: "vault-ui" # Needed for ConnectionRequests and firewall
      tls:
        termination: edge # TODO: enable https - certificate signed by unknown authority
    extraVolumes:
      - name: vault-config-user
        type: configMap
        defaultMode: 424
      - name: vault-ext-storage
        type: secret
        defaultMode: 420
      - name: vault-seal-config
        type: secret
        defaultMode: 420
      - name: vault-ui-certs
        type: secret
        defaultMode: 420
      - name: ca-bundle
        type: configMap
        defaultMode: 420
      - name: vault-ext-storage-ca
        type: secret
        defaultMode: 420
    extraSecretEnvironmentVars:
      - envName: VAULT_TOKEN
        secretName: vault-root-token
        secretKey: VAULT_TOKEN
      - envName: LDAP_PASSWORD
        secretName: vault-ldap-password
        secretKey: LDAP_PASSWORD
    extraArgs: -config=/vault/userconfig/vault-ext-storage/config.hcl -config=/vault/userconfig/vault-config-user/config.hcl -config=/vault/userconfig/vault-seal-config/config.hcl
    #see readme.md for the correct settings of the below properties
    serviceAccount:
      serviceDiscovery:
        enabled: false
    authDelegator:
      enabled: false
    dataStorage:
      enabled: false
    networkPolicy:
      enabled: false
    autoUnseal:
      enabled: true
    topologySpreadConstraints:
      # Pods equally distributed between the clusters
      - maxSkew: 1
        topologyKey: network/location
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app.kubernetes.io/name: vault
      # Pod not deployed if there is another pod if the same label on the same OCP node
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app.kubernetes.io/name: vault

config.hcl.tpl

{{- define "config.hcl" }}
disable_mlock = true
ui = true

service_registration "kubernetes" {}

listener "tcp" {
  {{- if .Values.vault.global.tlsDisable }}
  tls_disable = 1
  {{- else }}
  tls_cert_file = "/vault/userconfig/vault-ui-certs/tls.crt"
  tls_key_file = "/vault/userconfig/vault-ui-certs/tls.key"
  {{- end }}
  address = "[::]:8200"
  cluster_address = "[::]:8201"
}

{{- if eq (.Values.vault.server.dataStorage.enabled | toString) "true" }}
storage "file" {
  path = "/vault/data"
}
{{- end }}

{{- end }}

Unfortunately I do not have at the moment more logs (and this happened on 2 different vault clusters connected to 2 different mysql databases but deployed on the same OCP cluster).

Thank you

Hi @fabry006 – welcome to Discuss! :slight_smile:

It’s likely that your cluster destabilised after losing quorum (i.e., too many voting nodes in a short span of time).

Have you customised the Helm chart? If so, could you share some details on this?

Autopilot provides server stabilisation by default, by nothing will save your cluster if too many nodes go down at once; even though it wasn’t Vault itself that you were upgrading, the SOP for Vault upgrades might be illustrative of ‘rolling’ approach that is recommended for replacing nodes.

Thank you @jlj7
I’ve added the configuration in the main topic.
I am pretty sure that the nodes were stopped one by one (we have a PDB configured and a topologyConstraint) which should prevent to have more than one vault down at the same time.

It is also true that, after seeing that it was not working, we tried multiple times to manually restarts the pods and sometimes also more at the same time.

Once you lose quorum, deliberately stopping all the pods and then bringing them up one by one to join the cluster will be the quickest way to get a stable cluster (like you’re standing it up for the first time). (As you noted in your first post, sometimes, eventually, the Raft protocol can sort things out on its own, but, again, as you noted, that isn’t a quick or ideal option.)

One thing about your configuration that jumps out straight away: file backend storage; it doesn’t support HA, and is usually used for single Vault instances, testing, etc. (unless I’m missing something).

Edit: that said, HA is enabled, according to the status output you shared. :thinking:

Edit 2: yeah, I think that Helm configuration above is for standalone mode; or at least that’s what this article is suggesting. To be honest, I’m a bit out of my depth on this one; hope these pointers are helpful, though.

The file storage is disabled in our configuration so we are not using it.

We actually tried to stop all the pods and start them one by one… but nothing happened… it just got auto recovered by itself after a long time (~1 hour)

Ah, I missed that. Apologies. Makes sense.

What do you mean by “nothing happened?” Also, changing the time required before being promoted to a voting member of the cluster might help with its stability.

I mean that even restarting the pods one be one we ended up with the same status… no leader with more or less with the same logs that I’ve attached

I see in the logs that I attached in the main comment

"request_path":"/v1/auth/kubernetes/login","start_time":"2024-03-24T13:16:43Z","status_code":500}

May this be the root cause?
But it happened after

"request_path":"/v1/sys/step-down","start_time":"2024-03-24T13:15:22Z","status_code":500}

so I am not sure

I see that the default value is 10s… so it should be fine I think

Here more logs about standby and lock.
As you can see all 3 vaults were in standby from 12:12:54.302 till 13:07:26.617
In between we restarted them 3/4 times

2024-03-24 00:19:30.394	vault-1	{"@level":"info","@message":"acquired lock, enabling active operation","@module":"core","@timestamp":"2024-03-23T23:19:30.394345Z"}
2024-03-24 08:19:30.540	vault-0	{"@level":"info","@message":"acquired lock, enabling active operation","@module":"core","@timestamp":"2024-03-24T07:19:30.540702Z"}
2024-03-24 12:12:54.302	vault-2	{"@level":"info","@message":"entering standby mode","@module":"core","@timestamp":"2024-03-24T11:12:54.215612Z"}
2024-03-24 12:21:39.485	vault-0	{"@level":"info","@message":"entering standby mode","@module":"core","@timestamp":"2024-03-24T11:21:39.445200Z"}
2024-03-24 12:39:27.658	vault-1	{"@level":"info","@message":"entering standby mode","@module":"core","@timestamp":"2024-03-24T11:39:27.566548Z"}
2024-03-24 12:55:23.305	vault-0	{"@level":"info","@message":"entering standby mode","@module":"core","@timestamp":"2024-03-24T11:55:23.259256Z"}
2024-03-24 12:55:29.098	vault-1	{"@level":"info","@message":"entering standby mode","@module":"core","@timestamp":"2024-03-24T11:55:29.060388Z"}
2024-03-24 12:55:37.045	vault-2	{"@level":"info","@message":"entering standby mode","@module":"core","@timestamp":"2024-03-24T11:55:37.010761Z"}
2024-03-24 13:06:16.444	vault-2	{"@level":"info","@message":"entering standby mode","@module":"core","@timestamp":"2024-03-24T12:06:16.388519Z"}
2024-03-24 13:07:26.617	vault-0	{"@level":"info","@message":"entering standby mode","@module":"core","@timestamp":"2024-03-24T12:07:26.560633Z"}
2024-03-24 13:17:38.974	vault-1	{"@level":"info","@message":"acquired lock, enabling active operation","@module":"core","@timestamp":"2024-03-24T12:17:38.974918Z"}
2024-03-24 21:17:39.100	vault-2	{"@level":"info","@message":"acquired lock, enabling active operation","@module":"core","@timestamp":"2024-03-24T20:17:39.100790Z"}

these logs are for the second vault cluster (so not related to the logs I attached in the main comment) which suffered the same issue.

I found more logs actually in the other cluster where we had enabled our logging central solution.

2024-03-24 08:19:30.540	vault-0	{"@level":"info","@message":"acquired lock, enabling active operation","@module":"core","@timestamp":"2024-03-24T07:19:30.540702Z"}
2024-03-24 08:19:31.222	vault-1	{"@level":"error","@message":"unlocking HA lock failed","@module":"core","@timestamp":"2024-03-24T07:19:31.222616Z","error":"mysql: unable to release lock, already released or not held by this session"}
2024-03-24 10:33:13.622	vault-1	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T09:33:13.622949Z","error":"error during forwarding RPC request"}
2024-03-24 10:33:13.622	vault-1	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T09:33:13.622898Z","error":"rpc error: code = Canceled desc = context canceled"}
2024-03-24 10:38:32.536	vault-1	{"@level":"error","@message":"failed to acquire lock","@module":"core","@timestamp":"2024-03-24T09:38:32.536739Z","error":"invalid connection"}
2024-03-24 10:51:14.904	vault-1	{"@level":"error","@message":"failed to acquire lock","@module":"core","@timestamp":"2024-03-24T09:51:14.904615Z","error":"invalid connection"}
2024-03-24 10:51:24.195	vault-2	{"@level":"error","@message":"failed to acquire lock","@module":"core","@timestamp":"2024-03-24T09:51:24.195805Z","error":"invalid connection"}
2024-03-24 10:54:38.241	vault-2	{"@level":"error","@message":"failed to acquire lock","@module":"core","@timestamp":"2024-03-24T09:54:38.241953Z","error":"invalid connection"}
2024-03-24 12:12:54.302	vault-2	{"@level":"info","@message":"entering standby mode","@module":"core","@timestamp":"2024-03-24T11:12:54.215612Z"}
2024-03-24 12:21:05.964	vault-0	{"@level":"error","@message":"unlocking HA lock failed","@module":"core","@timestamp":"2024-03-24T11:21:05.964229Z","error":"mysql: unable to release lock, already released or not held by this session"}
2024-03-24 12:21:31.726	vault-2	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:21:31.726721Z","error":"error during forwarding RPC request"}
2024-03-24 12:21:31.726	vault-2	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:21:31.726657Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.16.3.34:8201: i/o timeout\""}
2024-03-24 12:21:39.485	vault-0	{"@level":"info","@message":"entering standby mode","@module":"core","@timestamp":"2024-03-24T11:21:39.445200Z"}
2024-03-24 12:23:08.584	vault-2	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:23:08.584546Z","error":"error during forwarding RPC request"}
2024-03-24 12:23:08.584	vault-2	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:23:08.584489Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:24:32.818	vault-2	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:24:32.818784Z","error":"error during forwarding RPC request"}
2024-03-24 12:24:32.818	vault-2	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:24:32.818701Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:25:13.644	vault-1	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:25:13.644322Z","error":"error during forwarding RPC request"}
2024-03-24 12:25:13.644	vault-1	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:25:13.644219Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:25:24.840	vault-1	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:25:24.840407Z","error":"error during forwarding RPC request"}
2024-03-24 12:25:24.840	vault-1	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:25:24.840340Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:25:34.441	vault-1	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:25:34.441819Z","error":"error during forwarding RPC request"}
2024-03-24 12:25:34.441	vault-1	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:25:34.441775Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:25:43.841	vault-1	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:25:43.841209Z","error":"error during forwarding RPC request"}
2024-03-24 12:25:43.841	vault-1	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:25:43.841151Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:26:02.249	vault-2	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:26:02.249413Z","error":"error during forwarding RPC request"}
2024-03-24 12:26:02.249	vault-2	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:26:02.249358Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:26:12.875	vault-1	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:26:12.875382Z","error":"error during forwarding RPC request"}
2024-03-24 12:26:12.875	vault-1	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:26:12.875337Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:26:28.641	vault-2	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:26:28.641197Z","error":"error during forwarding RPC request"}
2024-03-24 12:26:28.641	vault-2	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:26:28.641138Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:27:04.046	vault-2	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:27:04.046778Z","error":"error during forwarding RPC request"}
2024-03-24 12:27:04.046	vault-2	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:27:04.046714Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:27:34.042	vault-1	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:27:34.042392Z","error":"error during forwarding RPC request"}
2024-03-24 12:27:34.042	vault-1	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:27:34.042352Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:30:12.898	vault-1	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:30:12.898954Z","error":"error during forwarding RPC request"}
2024-03-24 12:30:12.898	vault-1	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:30:12.898896Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:31:12.823	vault-1	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:31:12.823953Z","error":"error during forwarding RPC request"}
2024-03-24 12:31:12.823	vault-1	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:31:12.823911Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:33:17.188	vault-2	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:33:17.188879Z","error":"error during forwarding RPC request"}
2024-03-24 12:33:17.188	vault-2	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:33:17.188810Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:34:21.543	vault-1	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:34:21.543184Z","error":"error during forwarding RPC request"}
2024-03-24 12:34:21.543	vault-1	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:34:21.543116Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:35:28.824	vault-2	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:35:28.824038Z","error":"error during forwarding RPC request"}
2024-03-24 12:35:28.824	vault-2	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:35:28.823980Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:36:41.074	vault-1	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:36:41.074737Z","error":"error during forwarding RPC request"}
2024-03-24 12:36:41.074	vault-1	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:36:41.074700Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:38:09.843	vault-1	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:38:09.843338Z","error":"error during forwarding RPC request"}
2024-03-24 12:38:09.843	vault-1	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:38:09.843291Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:39:27.658	vault-1	{"@level":"info","@message":"entering standby mode","@module":"core","@timestamp":"2024-03-24T11:39:27.566548Z"}
2024-03-24 12:39:37.849	vault-2	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:39:37.849279Z","error":"error during forwarding RPC request"}
2024-03-24 12:39:37.849	vault-2	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:39:37.849237Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:51:14.290	vault-2	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:51:14.290445Z","error":"error during forwarding RPC request"}
2024-03-24 12:51:14.290	vault-2	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:51:14.290384Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:53:33.725	vault-2	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:53:33.725875Z","error":"error during forwarding RPC request"}
2024-03-24 12:53:33.725	vault-2	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:53:33.725829Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:54:35.612	vault-2	{"@level":"error","@message":"forward request error","@module":"core","@timestamp":"2024-03-24T11:54:35.612175Z","error":"error during forwarding RPC request"}
2024-03-24 12:54:35.612	vault-2	{"@level":"error","@message":"error during forwarded RPC request","@module":"core","@timestamp":"2024-03-24T11:54:35.612127Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""}
2024-03-24 12:55:23.305	vault-0	{"@level":"info","@message":"entering standby mode","@module":"core","@timestamp":"2024-03-24T11:55:23.259256Z"}
2024-03-24 12:55:29.098	vault-1	{"@level":"info","@message":"entering standby mode","@module":"core","@timestamp":"2024-03-24T11:55:29.060388Z"}
2024-03-24 12:55:37.045	vault-2	{"@level":"info","@message":"entering standby mode","@module":"core","@timestamp":"2024-03-24T11:55:37.010761Z"}
2024-03-24 13:06:16.444	vault-2	{"@level":"info","@message":"entering standby mode","@module":"core","@timestamp":"2024-03-24T12:06:16.388519Z"}
2024-03-24 13:07:26.617	vault-0	{"@level":"info","@message":"entering standby mode","@module":"core","@timestamp":"2024-03-24T12:07:26.560633Z"}
2024-03-24 13:17:38.974	vault-1	{"@level":"info","@message":"acquired lock, enabling active operation","@module":"core","@timestamp":"2024-03-24T12:17:38.974918Z"}
2024-03-24 21:17:39.100	vault-2	{"@level":"info","@message":"acquired lock, enabling active operation","@module":"core","@timestamp":"2024-03-24T20:17:39.100790Z"}

I specifically meant stopping all pods, and then bringing up one, as leader, before adding others to the cluster.

Are you sure? How long are you leaving between pods restarts / start-ups? Also, have you considered running more than three nodes, to increase your overall fault tolerance, given the instability in your environment?

I

Ok now I get it, actually don’t remember if we tried this.

What do you suggest as value?
Yes we can add more nodes