Vault core lock issue

Hi All,

We are having 2 vault node cluster and 3 Consul servers node cluster.
Also using consul as a backend for the Vault server and deployed our servers in Kubernetes cluster.

Consul Server Version: 1.8.3
Vault Server Version: 1.7.3
Consul agent Version: 1.8.4

Few times we are observing below error in Consul server:
2022-11-04T05:05:10.928Z [INFO] agent.server.serf.lan: serf: EventMemberFailed: vault-1 10.x.x.x
2022-11-04T05:05:10.928Z [INFO] agent.server: member failed, marking health critical: member=vault-1
2022-11-04T05:05:10.939Z [WARN] agent.server.kvs: Rejecting lock of key due to lock-delay: key=vault/core/lock expire_time=β€œ2022-11-04 05:05:25.936845916 +0000 UTC m=+3246519.644580801”
2022-11-04T05:05:11.068Z [INFO] agent.server.serf.lan: serf: EventMemberJoin: vault-1 10.x.x.x
2022-11-04T05:05:11.068Z [INFO] agent.server: member joined, marking health alive: member=vault-1
2022-11-04T05:05:12.477Z [WARN] agent.server.kvs: Rejecting lock of key due to lock-delay: key=vault/core/lock expire_time=β€œ2022-11-04 05:05:25.936845916 +0000 UTC m=+3246519.644580801”
2022-11-04T05:05:15.942Z [WARN] agent.server.kvs: Rejecting lock of key due to lock-delay: key=vault/core/lock expire_time=β€œ2022-11-04 05:05:25.936845916 +0000 UTC m=+3246519.644580801”
2022-11-04T05:05:17.482Z [WARN] agent.server.kvs: Rejecting lock of key due to lock-delay: key=vault/core/lock expire_time=β€œ2022-11-04 05:05:25.936845916 +0000 UTC m=+3246519.644580801”
2022-11-04T05:05:20.946Z [WARN] agent.server.kvs: Rejecting lock of key due to lock-delay: key=vault/core/lock expire_time=β€œ2022-11-04 05:05:25.936845916 +0000 UTC m=+3246519.644580801”
2022-11-04T05:05:22.487Z [WARN] agent.server.kvs: Rejecting lock of key due to lock-delay: key=vault/core/lock expire_time=β€œ2022-11-04 05:05:25.936845916 +0000 UTC m=+3246519.644580801”

During this time it seems there is no active vault node and both vault will be in standy by mode.

On consul agent of Vault server, we are observing below error:
2022-11-04T05:05:25.957Z [ERROR] agent.client: RPC failed to server: method=KVS.Apply server=10.4.10.7:8300 error="rpc error making call: rpc error making call: invalid session β€œce713bb8-e669-2c3e-6706-8b2af933c2d6"”
2022-11-04T05:05:25.957Z [ERROR] agent.http: Request error: method=PUT url=/v1/kv/vault/core/lock?acquire=ce713bb8-e669-2c3e-6706-8b2af933c2d6&flags=3304740253564472344 from=127.0.0.1:59652 error="rpc error making call: rpc error making call: invalid session β€œce713bb8-e669-2c3e-6706-8b2af933c2d6"”

Also during the issue time other vault is not become leader and for more than 15 sec, there is no vault leader in a cluster.

It seems that Consul’s liveness checking of nodes determined that vault-1 had probably failed:

It therefore released the leadership lock owned by that node, and placed it into lock-delay state for 15 seconds, during which, no Vault node was allowed to become leader.

This is a safety measure to ensure that the failed node had genuinely realised it was no longer the leader, to avoid data consistency problems if there were multiple Vault nodes trying to be leader at once.

1 Like