Hi team,
I’m writing to ask for help in improving the custom plugin upgrade process for our Kubernetes StatefulSet running Vault.
Our current setup is as follows:
We have developed our own plugins for Vault.
We have 3 replicas of the Vault pod in the StatefulSet with “RollingUpdate” strategy.
When a pod starts running, it checks in its init container if it has a new plugin version and, if so, it upgrades the plugin by registering its checksum.
The main pod container just run Vault server.
One of the possible upgrade scenarios is as follows:
A new Vault image is updated in the StatefulSet.
The Vault-2 pod restarts. It was the leader pod. Now Vault-1 is selected to be the leader pod.
Vault-2 finds that it’s running with a new plugin version that is different from the currently registered version.
Vault-2 registers the new plugin version.
Vault-2 starts running the main container with the new vault version and enters standby mode.
Vault-1 restarts. Vault-0 becomes the active pod and the leader.
Vault-0 cannot start running the plugin because it has the old binary that doesn’t match the new registered checksum.
Vault-1 starts running the new vault version and enters standby mode.
Vault-0 restarts. Vault-2 is selected to be the leader pod.
Vault-2 starts running the new plugin version.
In this scenario, there is a downtime from step 4 to step 10 because the leader pod can’t serve requests to the plugin (checksums does not match). It can be up to 2 minutes. This is the worst-case scenario. Sometimes, Vault-2 is immediately selected as the leader, in which case there is almost no downtime.
I’m wondering how we can improve the worst-case scenario to decrease the downtime.
Thank you in advance
PS I found that sometimes a request to the leader pod that runs an old plugin version can succeed and sometimes the same request fails with the error message failed to run existence check (checksums did not match)
What determines whether the request succeeds or fails?
maxb
August 18, 2023, 7:41pm
2
This issue came up in the Vault issue tracker before:
opened 08:29AM - 21 Apr 23 UTC
closed 07:31PM - 14 Jul 23 UTC
ecosystem/plugin
k8s
<!-- Please reserve GitHub issues for bug reports and feature requests.
For q… uestions, the best place to get answers is on our [discussion forum](https://discuss.hashicorp.com/c/vault), as they will get more visibility from experienced users than the issue tracker.
Please note: We take Vault's security and our users' trust very seriously. If you believe you have found a security issue in Vault, please responsibly disclose by contacting us at security@hashicorp.com. Our PGP key is available at [our security page](https://www.hashicorp.com/security/).
-->
**Describe the bug**
I try to register a custom plugin i wrote. I've built a OCI image based on `vault:1.12.5` and added my custom plugin.
```console
$ kubectl exec -it -n vault vault-0 -c vault -- sha256sum /etc/vault/vault_plugins/vault-plugin-secrets-nats
13c753a26991858faf820604c6422c31e49368481a18335c6540ac28a7ce2aac /etc/vault/vault_plugins/vault-plugin-secrets-nats
$ vault plugin info secret vault-plugin-secrets-nats
Key Value
--- -----
args [--tls-skip-verify --ca-cert=/vault/tls/ca.crt]
builtin false
command vault-plugin-secrets-nats
deprecation_status n/a
name vault-plugin-secrets-nats
sha256 13c753a26991858faf820604c6422c31e49368481a18335c6540ac28a7ce2aac
version n/a
```
However, when i log at the logs of my vault instance i get errors like:
```
2023-04-21T08:17:38.476Z [INFO] core: successfully setup plugin catalog: plugin-directory=/etc/vault/vault_plugins
2023-04-21T08:17:38.479Z [INFO] core: successfully mounted: type=system version="v1.12.5+builtin.vault" path=sys/ namespace="ID: root. Path: "
2023-04-21T08:17:38.479Z [INFO] core: successfully mounted: type=identity version="v1.12.5+builtin.vault" path=identity/ namespace="ID: root. Path: "
2023-04-21T08:17:38.480Z [DEBUG] core: spawning a new plugin process: plugin_name=vault-plugin-secrets-nats id=Ed8aZMet4m
2023-04-21T08:17:38.690Z [ERROR] core: failed to create mount entry: path=nats-secrets/
error=
| invalid backend version: 2 errors occurred:
| \t* checksums did not match
| \t* checksums did not match
|
2023-04-21T08:17:38.691Z [WARN] core: skipping plugin-based mount entry: path=nats-secrets/
2023-04-21T08:17:38.691Z [INFO] core: successfully mounted: type=vault-plugin-secrets-nats version="" path=nats-secrets/ namespace="ID: root. Path: "
```
After deleting the pod three times this worked.
```
2023-04-21T08:30:48.504Z [INFO] core: upgrading plugin information: plugins=[]
2023-04-21T08:30:48.504Z [INFO] core: successfully setup plugin catalog: plugin-directory=/etc/vault/vault_plugins
2023-04-21T08:30:48.506Z [INFO] core: successfully mounted: type=system version="v1.12.5+builtin.vault" path=sys/ namespace="ID: root. Path: "
2023-04-21T08:30:48.508Z [INFO] core: successfully mounted: type=identity version="v1.12.5+builtin.vault" path=identity/ namespace="ID: root. Path: "
2023-04-21T08:30:48.509Z [DEBUG] core: spawning a new plugin process: plugin_name=vault-plugin-secrets-nats id=UPBhZnUJ7I
2023-04-21T08:30:48.588Z [INFO] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: configuring client automatic mTLS
2023-04-21T08:30:48.660Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: starting plugin: path=/etc/vault/vault_plugins/vault-plugin-secrets-nats args=["/etc/vault/vault_plugins/vault-plugin-secrets-nats", "--tls-skip-verify", "--ca-cert=/vault/tls/ca.crt"]
2023-04-21T08:30:48.660Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: plugin started: path=/etc/vault/vault_plugins/vault-plugin-secrets-nats pid=29
2023-04-21T08:30:48.660Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: waiting for RPC address: path=/etc/vault/vault_plugins/vault-plugin-secrets-nats
2023-04-21T08:30:48.671Z [ERROR] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats.vault-plugin-secrets-nats: plugin tls init: error="error parsing wrapping token: square/go-jose: compact JWS format must have three parts" timestamp=2023-04-21T08:30:48.670Z
2023-04-21T08:30:48.674Z [INFO] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: plugin process exited: path=/etc/vault/vault_plugins/vault-plugin-secrets-nats pid=29
2023-04-21T08:30:48.787Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: starting plugin: metadata=true path=/etc/vault/vault_plugins/vault-plugin-secrets-nats args=["/etc/vault/vault_plugins/vault-plugin-secrets-nats", "--tls-skip-verify", "--ca-cert=/vault/tls/ca.crt"]
2023-04-21T08:30:48.788Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: plugin started: metadata=true path=/etc/vault/vault_plugins/vault-plugin-secrets-nats pid=33
2023-04-21T08:30:48.788Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: waiting for RPC address: metadata=true path=/etc/vault/vault_plugins/vault-plugin-secrets-nats
2023-04-21T08:30:48.796Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats.vault-plugin-secrets-nats: plugin address: metadata=true network=unix address=/tmp/plugin454960758 timestamp=2023-04-21T08:30:48.796Z
2023-04-21T08:30:48.796Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: using plugin: metadata=true version=4
2023-04-21T08:30:48.859Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: setup: transport=gRPC status=started
2023-04-21T08:30:48.859Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats.stdio: waiting for stdio data: metadata=true
2023-04-21T08:30:48.862Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: setup: transport=gRPC status=finished err=<nil> took=3.425275ms
2023-04-21T08:30:48.862Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: special paths: transport=gRPC status=started
2023-04-21T08:30:48.863Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: special paths: transport=gRPC status=finished took="745.36µs"
2023-04-21T08:30:48.863Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: type: transport=gRPC status=started
2023-04-21T08:30:48.864Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: type: transport=gRPC status=finished took="488.678µs"
2023-04-21T08:30:48.864Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: version: transport=gRPC status=started
2023-04-21T08:30:48.864Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: version: transport=gRPC status=finished took="560.649µs"
2023-04-21T08:30:48.864Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: cleanup: transport=gRPC status=started
2023-04-21T08:30:48.865Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: cleanup: transport=gRPC status=finished took="756.553µs"
2023-04-21T08:30:48.866Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats.stdio: received EOF, stopping recv loop: metadata=true err="rpc error: code = Unavailable desc = error reading from server: EOF"
2023-04-21T08:30:48.867Z [INFO] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: plugin process exited: metadata=true path=/etc/vault/vault_plugins/vault-plugin-secrets-nats pid=33
2023-04-21T08:30:48.867Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: plugin exited: metadata=true
2023-04-21T08:30:48.867Z [INFO] core: successfully mounted: type=vault-plugin-secrets-nats version="" path=nats-secrets/ namespace="ID: root. Path: "
```
**Expected behavior**
I'd expect, that i can successfully register the plugin on the first approach.
**Environment:**
* Vault Server Version (retrieve with `vault status`):
```
Key Value
--- -----
Seal Type shamir
Initialized true
Sealed false
Total Shares 5
Threshold 3
Version 1.12.5
Build Date 2023-03-23T12:51:46Z
Storage Type file
Cluster Name vault-cluster-45b7d7a0
Cluster ID c3efee13-6b46-8025-03aa-0a9abeab41ca
HA Enabled false
```
* Vault CLI Version (retrieve with `vault version`): `Vault v1.12.3 (209b3dd99fe8ca320340d08c70cff5f620261f9b), built 2023-02-02T09:07:27Z`
I believe the essence of the problem, is that the Vault plugin mechanism is fundamentally incompatible with Kubernetes. It appears to be designed for deployments on VMs, where plugins are only updated whilst stable Vault servers continue to run undisturbed.
I proposed in the above-linked issue:
I wonder if HashiCorp would be willing to have a conversation about making the checksum verification of plugins optional … the current approach doesn’t seem well suited to maintaining uptime of a cluster during upgrade in K8s?
There was a response, but I declined to take the lead on pursuing it, as I personally do not run Vault on Kubernetes.