Vault auto-unseal suddenly stopped working

Hey folks,

I’ve been running Vault on GCP with no problems for at least 6 weeks, but it suddenly stopped working today after it was restarted, and now auto-unseal doesn’t work anymore.

Here’s what I get on the logs when Vault was stopped:

==> Vault shutdown triggered
2021-01-19T08:26:35.883Z [INFO]  core: marked as sealed
2021-01-19T08:26:35.883Z [INFO]  core: stopping cluster listeners
2021-01-19T08:26:35.883Z [INFO]  core.cluster-listener: forwarding rpc listeners stopped
2021-01-19T08:26:36.163Z [INFO]  core.cluster-listener: rpc listeners successfully shut down
2021-01-19T08:26:36.163Z [INFO]  core: cluster listeners successfully shut down
2021-01-19T08:26:36.163Z [INFO]  core: vault is sealed
==> Vault server configuration:

    GCP KMS Crypto Key: vault-auto-unseal
        GCP KMS Key Ring: gateway-frankfurt-46b07e
        GCP KMS Project: relaycorp-cloud-gateway
        GCP KMS Region: europe-west3
            Api Address: http://10.72.1.16:8200
                    Cgo: disabled
        Cluster Address: https://vault-1.vault-internal:8201
            Go Version: go1.14.7
            Listener 1: tcp (addr: "[::]:8200", cluster address: "[::]:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")
            Log Level: info
                Mlock: supported: true, enabled: false
        Recovery Mode: false
                Storage: gcs (HA available)
                Version: Vault v1.5.4
            Version Sha: 1a730771ec70149293efe91e1d283b10d255c6d1

==> Vault server started! Log data will stream in below:

2021-01-19T08:27:04.637Z [INFO]  proxy environment: http_proxy= https_proxy= no_proxy=
2021-01-19T08:27:05.169Z [INFO]  core: stored unseal keys supported, attempting fetch
2021-01-19T08:27:05.185Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
2021-01-19T08:27:10.185Z [INFO]  core: stored unseal keys supported, attempting fetch
2021-01-19T08:27:10.200Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
2021-01-19T08:27:11.654Z [INFO]  core.autoseal: seal configuration missing, but cannot check old path as core is sealed: seal_type=recovery
2021-01-19T08:27:15.200Z [INFO]  core: stored unseal keys supported, attempting fetch
2021-01-19T08:27:15.215Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
2021-01-19T08:27:16.651Z [INFO]  core.autoseal: seal configuration missing, but cannot check old path as core is sealed: seal_type=recovery

I can see the last three lines repeated non-stop on the logs since the outage started.

I haven’t changed the auto-unseal configuration recently, nor have I changed the KMS key. I’ve just checked and the key still exists.

I’m using the official Helm chart (v0.8.0), with this values.yml file, and this partial config file. And here’s the Terraform resources for the auto-unseal on GCP.

Any idea what could be going on, or how to debug this further?

Thanks!

I should point out that I didn’t stop the server. I suspect the vault-agent-injector did, given the activity I saw in its logs around the same time:

2021-01-19T08:26:16.809Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=30s
2021-01-19T08:26:16.816Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=30s
2021-01-19T08:26:16.867Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=30s
2021-01-19T08:26:17.148Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=30s
2021-01-19T08:26:26.855Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=30s
2021-01-19T08:26:30.606Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=30s
2021-01-19T08:26:30.606Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=30s
2021-01-19T08:26:30.610Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=30s
2021-01-19T08:26:30.730Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=30s
Listening on ":8080"...
2021-01-19T08:26:53.333Z [INFO]  handler: Starting handler..
Updated certificate bundle received. Updating certs...
2021-01-19T08:27:38.901Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=30s

I’ve also just killed the pods, but the new ones have the exact same issue according to the logs.

Is your GCP credential json still valid?
Ie, account_file_path

1 Like

This is really embarrassing but… I accidentally set a lifecycle rule on the GCS bucket to empty the bucket, which I meant to do on a different bucket.

@mikegreen, thank you for looking into this anyway!