Custom plugin upgrade in Kubernetes StatefulSet

tatyanab · August 17, 2023, 3:19pm

Hi team,

I’m writing to ask for help in improving the custom plugin upgrade process for our Kubernetes StatefulSet running Vault.

Our current setup is as follows:

We have developed our own plugins for Vault.
We have 3 replicas of the Vault pod in the StatefulSet with “RollingUpdate” strategy.
When a pod starts running, it checks in its init container if it has a new plugin version and, if so, it upgrades the plugin by registering its checksum.
The main pod container just run Vault server.

One of the possible upgrade scenarios is as follows:

A new Vault image is updated in the StatefulSet.
The Vault-2 pod restarts. It was the leader pod. Now Vault-1 is selected to be the leader pod.
Vault-2 finds that it’s running with a new plugin version that is different from the currently registered version.
Vault-2 registers the new plugin version.
Vault-2 starts running the main container with the new vault version and enters standby mode.
Vault-1 restarts. Vault-0 becomes the active pod and the leader.
Vault-0 cannot start running the plugin because it has the old binary that doesn’t match the new registered checksum.
Vault-1 starts running the new vault version and enters standby mode.
Vault-0 restarts. Vault-2 is selected to be the leader pod.
Vault-2 starts running the new plugin version.

In this scenario, there is a downtime from step 4 to step 10 because the leader pod can’t serve requests to the plugin (checksums does not match). It can be up to 2 minutes. This is the worst-case scenario. Sometimes, Vault-2 is immediately selected as the leader, in which case there is almost no downtime.

I’m wondering how we can improve the worst-case scenario to decrease the downtime.
Thank you in advance

PS I found that sometimes a request to the leader pod that runs an old plugin version can succeed and sometimes the same request fails with the error message failed to run existence check (checksums did not match)
What determines whether the request succeeds or fails?

maxb · August 18, 2023, 7:41pm

This issue came up in the Vault issue tracker before:

github.com/hashicorp/vault

Cannot mount custom plugin - checksums did not match

opened 08:29AM - 21 Apr 23 UTC

closed 07:31PM - 14 Jul 23 UTC

siredmar

ecosystem/plugin k8s

**Describe the bug** I try to register a custom plugin i wrote. I've built a OCI image based on `vault:1.12.5` and added my custom plugin. ```console $ kubectl exec -it -n vault vault-0 -c vault -- sha256sum /etc/vault/vault_plugins/vault-plugin-secrets-nats 13c753a26991858faf820604c6422c31e49368481a18335c6540ac28a7ce2aac /etc/vault/vault_plugins/vault-plugin-secrets-nats $ vault plugin info secret vault-plugin-secrets-nats Key Value --- ----- args [--tls-skip-verify --ca-cert=/vault/tls/ca.crt] builtin false command vault-plugin-secrets-nats deprecation_status n/a name vault-plugin-secrets-nats sha256 13c753a26991858faf820604c6422c31e49368481a18335c6540ac28a7ce2aac version n/a ``` However, when i log at the logs of my vault instance i get errors like: ``` 2023-04-21T08:17:38.476Z [INFO] core: successfully setup plugin catalog: plugin-directory=/etc/vault/vault_plugins 2023-04-21T08:17:38.479Z [INFO] core: successfully mounted: type=system version="v1.12.5+builtin.vault" path=sys/ namespace="ID: root. Path: " 2023-04-21T08:17:38.479Z [INFO] core: successfully mounted: type=identity version="v1.12.5+builtin.vault" path=identity/ namespace="ID: root. Path: " 2023-04-21T08:17:38.480Z [DEBUG] core: spawning a new plugin process: plugin_name=vault-plugin-secrets-nats id=Ed8aZMet4m 2023-04-21T08:17:38.690Z [ERROR] core: failed to create mount entry: path=nats-secrets/ error= | invalid backend version: 2 errors occurred: | \t* checksums did not match | \t* checksums did not match | 2023-04-21T08:17:38.691Z [WARN] core: skipping plugin-based mount entry: path=nats-secrets/ 2023-04-21T08:17:38.691Z [INFO] core: successfully mounted: type=vault-plugin-secrets-nats version="" path=nats-secrets/ namespace="ID: root. Path: " ``` After deleting the pod three times this worked. ``` 2023-04-21T08:30:48.504Z [INFO] core: upgrading plugin information: plugins=[] 2023-04-21T08:30:48.504Z [INFO] core: successfully setup plugin catalog: plugin-directory=/etc/vault/vault_plugins 2023-04-21T08:30:48.506Z [INFO] core: successfully mounted: type=system version="v1.12.5+builtin.vault" path=sys/ namespace="ID: root. Path: " 2023-04-21T08:30:48.508Z [INFO] core: successfully mounted: type=identity version="v1.12.5+builtin.vault" path=identity/ namespace="ID: root. Path: " 2023-04-21T08:30:48.509Z [DEBUG] core: spawning a new plugin process: plugin_name=vault-plugin-secrets-nats id=UPBhZnUJ7I 2023-04-21T08:30:48.588Z [INFO] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: configuring client automatic mTLS 2023-04-21T08:30:48.660Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: starting plugin: path=/etc/vault/vault_plugins/vault-plugin-secrets-nats args=["/etc/vault/vault_plugins/vault-plugin-secrets-nats", "--tls-skip-verify", "--ca-cert=/vault/tls/ca.crt"] 2023-04-21T08:30:48.660Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: plugin started: path=/etc/vault/vault_plugins/vault-plugin-secrets-nats pid=29 2023-04-21T08:30:48.660Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: waiting for RPC address: path=/etc/vault/vault_plugins/vault-plugin-secrets-nats 2023-04-21T08:30:48.671Z [ERROR] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats.vault-plugin-secrets-nats: plugin tls init: error="error parsing wrapping token: square/go-jose: compact JWS format must have three parts" timestamp=2023-04-21T08:30:48.670Z 2023-04-21T08:30:48.674Z [INFO] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: plugin process exited: path=/etc/vault/vault_plugins/vault-plugin-secrets-nats pid=29 2023-04-21T08:30:48.787Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: starting plugin: metadata=true path=/etc/vault/vault_plugins/vault-plugin-secrets-nats args=["/etc/vault/vault_plugins/vault-plugin-secrets-nats", "--tls-skip-verify", "--ca-cert=/vault/tls/ca.crt"] 2023-04-21T08:30:48.788Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: plugin started: metadata=true path=/etc/vault/vault_plugins/vault-plugin-secrets-nats pid=33 2023-04-21T08:30:48.788Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: waiting for RPC address: metadata=true path=/etc/vault/vault_plugins/vault-plugin-secrets-nats 2023-04-21T08:30:48.796Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats.vault-plugin-secrets-nats: plugin address: metadata=true network=unix address=/tmp/plugin454960758 timestamp=2023-04-21T08:30:48.796Z 2023-04-21T08:30:48.796Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: using plugin: metadata=true version=4 2023-04-21T08:30:48.859Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: setup: transport=gRPC status=started 2023-04-21T08:30:48.859Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats.stdio: waiting for stdio data: metadata=true 2023-04-21T08:30:48.862Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: setup: transport=gRPC status=finished err=<nil> took=3.425275ms 2023-04-21T08:30:48.862Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: special paths: transport=gRPC status=started 2023-04-21T08:30:48.863Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: special paths: transport=gRPC status=finished took="745.36µs" 2023-04-21T08:30:48.863Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: type: transport=gRPC status=started 2023-04-21T08:30:48.864Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: type: transport=gRPC status=finished took="488.678µs" 2023-04-21T08:30:48.864Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: version: transport=gRPC status=started 2023-04-21T08:30:48.864Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: version: transport=gRPC status=finished took="560.649µs" 2023-04-21T08:30:48.864Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: cleanup: transport=gRPC status=started 2023-04-21T08:30:48.865Z [TRACE] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: cleanup: transport=gRPC status=finished took="756.553µs" 2023-04-21T08:30:48.866Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats.stdio: received EOF, stopping recv loop: metadata=true err="rpc error: code = Unavailable desc = error reading from server: EOF" 2023-04-21T08:30:48.867Z [INFO] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: plugin process exited: metadata=true path=/etc/vault/vault_plugins/vault-plugin-secrets-nats pid=33 2023-04-21T08:30:48.867Z [DEBUG] secrets.vault-plugin-secrets-nats.vault-plugin-secrets-nats_a9ca485e.vault-plugin-secrets-nats: plugin exited: metadata=true 2023-04-21T08:30:48.867Z [INFO] core: successfully mounted: type=vault-plugin-secrets-nats version="" path=nats-secrets/ namespace="ID: root. Path: " ``` **Expected behavior** I'd expect, that i can successfully register the plugin on the first approach. **Environment:** * Vault Server Version (retrieve with `vault status`): ``` Key Value --- ----- Seal Type shamir Initialized true Sealed false Total Shares 5 Threshold 3 Version 1.12.5 Build Date 2023-03-23T12:51:46Z Storage Type file Cluster Name vault-cluster-45b7d7a0 Cluster ID c3efee13-6b46-8025-03aa-0a9abeab41ca HA Enabled false ``` * Vault CLI Version (retrieve with `vault version`): `Vault v1.12.3 (209b3dd99fe8ca320340d08c70cff5f620261f9b), built 2023-02-02T09:07:27Z`

I believe the essence of the problem, is that the Vault plugin mechanism is fundamentally incompatible with Kubernetes. It appears to be designed for deployments on VMs, where plugins are only updated whilst stable Vault servers continue to run undisturbed.

I proposed in the above-linked issue:

I wonder if HashiCorp would be willing to have a conversation about making the checksum verification of plugins optional … the current approach doesn’t seem well suited to maintaining uptime of a cluster during upgrade in K8s?

There was a response, but I declined to take the lead on pursuing it, as I personally do not run Vault on Kubernetes.

Topic		Replies	Views
Help with plugin usage in K8S Vault k8s	0	317	January 25, 2021
Vault Stuck after cluster restarts Vault k8s , vault	10	890	March 25, 2024
Vault agent injector throws error 'tls: bad certificate' after each 24 hours Vault vault	0	2466	September 22, 2022
Register a custom plugin to a production vault running in k8s Vault k8s , vault	1	372	February 22, 2023
Vault Agent Injector Not Being Triggered Vault k8s , vault	22	7571	June 19, 2023

Custom plugin upgrade in Kubernetes StatefulSet

Related topics