Vault moved into Sealed state suddenly(1.4.3)

Team - I have a vault cluster that was launched in GCP with 2 instances of Vault. All of a sudden this morning i have noticed one of the vault instances is not up which resulted in to the sealed state. TBH not sure what went wrong as this is the first time i am noticing something like this, need your expertise to help me with troubleshooting. Sharing some of the logs from it which can point to the issue accurately.

==> Vault server configuration:

          Listener 1: tcp (addr: "[::]:8200", cluster address: "[::]:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "enabled")
           Log Level: info
               Mlock: supported: true, enabled: false
       Recovery Mode: false
             Storage: etcd (HA available)
             Version: Vault v1.4.3

==> Vault server started! Log data will stream in below:

2021-01-12T23:03:51.721Z [INFO] proxy environment: http_proxy= https_proxy= no_proxy=
2021-01-12T23:03:52.452Z [INFO] core: stored unseal keys supported, attempting fetch
2021-01-12T23:03:52.508Z [INFO] core.cluster-listener.tcp: starting listener: listener_address=[::]:8201
2021-01-12T23:03:52.508Z [INFO] core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2021-01-12T23:03:52.508Z [INFO] core: vault is unsealed
2021-01-12T23:03:52.508Z [INFO] core: entering standby mode
2021-01-12T23:03:52.524Z [INFO] core: unsealed with stored keys: stored_keys_used=1
{“level”:“warn”,“ts”:“2021-01-12T23:55:42.537Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-788e8e41-18b4-4554-b0be-8dfcda3cd540/vault-etcd.sedvip-dev.svc.cluster.local:2379”,“attempt”:0,“error”:“rpc error: code = Unavailable desc = etcdserver: leader changed”}
{“level”:“warn”,“ts”:“2021-01-18T23:11:24.362Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-788e8e41-18b4-4554-b0be-8dfcda3cd540/vault-etcd.sedvip-dev.svc.cluster.local:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
{“level”:“warn”,“ts”:“2021-01-18T23:11:26.173Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-788e8e41-18b4-4554-b0be-8dfcda3cd540/vault-etcd.sedvip-dev.svc.cluster.local:2379”,“attempt”:1,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
2021-01-18T23:11:26.173Z [ERROR] core: key rotation periodic upgrade check failed: error=“context deadline exceeded”
{“level”:“warn”,“ts”:“2021-01-18T23:12:12.359Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-788e8e41-18b4-4554-b0be-8dfcda3cd540/vault-etcd.sedvip-dev.svc.cluster.local:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
{“level”:“warn”,“ts”:“2021-01-18T23:12:14.565Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-788e8e41-18b4-4554-b0be-8dfcda3cd540/vault-etcd.sedvip-dev.svc.cluster.local:2379”,“attempt”:0,“error”:“rpc error: code = Unavailable desc = etcdserver: leader changed”}
{“level”:“warn”,“ts”:“2021-01-18T23:12:14.565Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-788e8e41-18b4-4554-b0be-8dfcda3cd540/vault-etcd.sedvip-dev.svc.cluster.local:2379”,“attempt”:0,“error”:“rpc error: code = Unavailable desc = etcdserver: leader changed”}
{“level”:“warn”,“ts”:“2021-01-18T23:12:19.819Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-788e8e41-18b4-4554-b0be-8dfcda3cd540/vault-etcd.sedvip-dev.svc.cluster.local:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
{“level”:“warn”,“ts”:“2021-01-18T23:12:21.334Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-788e8e41-18b4-4554-b0be-8dfcda3cd540/vault-etcd.sedvip-dev.svc.cluster.local:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
{“level”:“warn”,“ts”:“2021-01-18T23:12:27.121Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-788e8e41-18b4-4554-b0be-8dfcda3cd540/vault-etcd.sedvip-dev.svc.cluster.local:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
{“level”:“warn”,“ts”:“2021-01-18T23:15:09.378Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-788e8e41-18b4-4554-b0be-8dfcda3cd540/vault-etcd.sedvip-dev.svc.cluster.local:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
2021-01-18T23:15:14.493Z [INFO] core: acquired lock, enabling active operation
{“level”:“warn”,“ts”:“2021-01-18T23:15:18.351Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-788e8e41-18b4-4554-b0be-8dfcda3cd540/vault-etcd.sedvip-dev.svc.cluster.local:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
{“level”:“warn”,“ts”:“2021-01-18T23:15:19.590Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-788e8e41-18b4-4554-b0be-8dfcda3cd540/vault-etcd.sedvip-dev.svc.cluster.local:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
2021-01-18T23:15:19.590Z [ERROR] core: error performing key upgrades: error=“error reloading master key: error reloading master key: failed to read master key path: context deadline exceeded”
2021-01-18T23:15:19.590Z [INFO] core: marked as sealed
{“level”:“warn”,“ts”:“2021-01-18T23:15:20.701Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-788e8e41-18b4-4554-b0be-8dfcda3cd540/vault-etcd.sedvip-dev.svc.cluster.local:2379”,“attempt”:0,“error”:“rpc error: code = Canceled desc = context canceled”}
2021-01-18T23:15:20.701Z [INFO] core: stopping cluster listeners
2021-01-18T23:15:20.701Z [INFO] core.cluster-listener: forwarding rpc listeners stopped
2021-01-18T23:15:21.012Z [INFO] core.cluster-listener: rpc listeners successfully shut down
2021-01-18T23:15:21.012Z [INFO] core: cluster listeners successfully shut down
2021-01-18T23:15:23.541Z [INFO] core: vault is sealed
2021-01-19T03:39:14.260Z [INFO] http: TLS handshake error from 10.245.135.49:12662: remote error: tls: unknown certificate
{“level”:“warn”,“ts”:“2021-01-19T03:39:15.079Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-788e8e41-18b4-4554-b0be-8dfcda3cd540/vault-etcd.sedvip-dev.svc.cluster.local:2379”,“attempt”:0,“error”:“rpc error: code = Unauthenticated desc = etcdserver: invalid auth token”}
{“level”:“warn”,“ts”:“2021-01-19T03:44:25.520Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-788e8e41-18b4-4554-b0be-8dfcda3cd540/vault-etcd.sedvip-dev.svc.cluster.local:2379”,“attempt”:0,“error”:“rpc error: code = Unauthenticated desc = etcdserver: invalid auth token”}

Also all my ETCD instances show some common errors below very repeatedly.

2021-01-19 01:01:38.437900 I | auth: deleting token TQIeTjyjmSBnEkKy.1167591 for user root

Some more information from the K8 events of ETCD instance.
Events:
Type Reason Age From Message


Warning Unhealthy 93s (x26251 over 11d) kubelet (combined from similar events): Readiness probe failed: ==> Bash debug is on
==> [DEBUG] Probing etcd cluster
==> [DEBUG] Probe command: “etcdctl --user root:<> --cert=/opt/bitnami/etcd/certs/client/cert.pem --key=/opt/bitnami/etcd/certs/client/key.pem --cacert=/opt/bitnami/etcd/certs/client/ca.crt endpoint health”
{“level”:“warn”,“ts”:“2021-01-20T06:44:01.695Z”,“caller”:“clientv3/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-dd5b9d5f-a762-40bf-8fe6-796d8d879199/127.0.0.1:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
127.0.0.1:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster

Will look forward to your advice.

1 Like

It almost sounds like vault cannot assemble the unseal keys, any indication that you’re losing your connection to consul?

Thanks for the response. Actually, we are using etcd as the storage backend and apart from the above logs i don’t find any other suspecting indicators which explicitly talk about the connection loss.

Context Deadline Exceeded and Failed to commit proposal are the ones seen very frequently on the instances of etcd.

You have any ideas for troubleshooting this as it is causing our vault cluster to become unstable.

Not sure if this info can be of any help or gives you a clue. I just entered into one of the etcd instances to run the command line tool etcdctl. With the following put operation, I am seeing the context issue again, ideally, it has to return as OK.

etcdctl put username viswa

{“level”:“warn”,“ts”:“2021-01-20T07:57:51.708Z”,“caller”:“clientv3/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-563fd340-a45a-41df-9ba0-3078be41fdfa/127.0.0.1:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = latest connection error: connection closed”}
Error: context deadline exceeded

Sorry I don’t have any experience in etcd. Is this OSS or Enterprise?

Can you startup vault with debug logs enabled? Maybe that’ll give you a better clue on why it can’t re-read the master key.

vault server -config=/etc/vault/config-file.hcl -log-level=debug

The other option is to run a ‘vault debug’ and capture the disconnect but I would try the logging first.

We are using the OSS version of Vault.

What all other possible reasons for this error where etcd loses connection with vault and almost never comes back with continuous restarts on its instances.

With the Debug enabled inside the etcd pods i see the following events.

==> [DEBUG] Probing etcd cluster
==> [DEBUG] Probe command: “etcdctl --user root:0UwbGgJ1aihto0IO --cert=/opt/bitnami/etcd/certs/client/cert.pem --key=/opt/bitnami/etcd/certs/client/key.pem --cacert=/opt/bitnami/etcd/certs/client/ca.crt endpoint health”
{“level”:“warn”,“ts”:“2021-01-21T13:32:33.493Z”,“caller”:“clientv3/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-411a24c0-6236-46a3-a905-08194c29782c/127.0.0.1:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
127.0.0.1:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster

Here is a new log that I found today that is causing the vault into a sealed state.

Can someone closely take a look and let me know your valuable thoughts?

2021-01-29T15:45:17.793Z [ERROR] core: key rotation periodic upgrade check failed: error=“context deadline exceeded”
{“level”:“warn”,“ts”:“2021-01-29T15:45:18.483Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-28584bac-868c-42d3-9ed6-aeeb7a8b182d/r01-etcd.vault-notls.svc.cluster.local:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
2021-01-29T15:45:18.483Z [ERROR] core: error performing key upgrades: error=“error checking for key upgrades: context deadline exceeded”
2021-01-29T15:45:18.483Z [INFO] core: marked as sealed
{“level”:“warn”,“ts”:“2021-01-29T15:45:23.493Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-28584bac-868c-42d3-9ed6-aeeb7a8b182d/r01-etcd.vault-notls.svc.cluster.local:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
2021-01-29T15:45:25.214Z [INFO] core: stopping cluster listeners
2021-01-29T15:45:25.214Z [INFO] core.cluster-listener: forwarding rpc listeners stopped
2021-01-29T15:45:25.489Z [INFO] core.cluster-listener: rpc listeners successfully shut down
2021-01-29T15:45:25.489Z [INFO] core: cluster listeners successfully shut down
2021-01-29T15:45:25.511Z [INFO] core: vault is sealed
2021-01-29T15:45:25.512Z [ERROR] core: key rotation periodic upgrade check failed: error=“Vault is sealed”

1 Like

were you able to resolve this issue?