Vault standby doesn't becomes active immediately when vault-active is down

Hi Team,

We are using the public helm chart for Vault with 0.22.0 version https://helm.releases.hashicorp.com with ha enabled. Config for the same is:

      ha:
        enabled: true
        replicas: 3
        config: |
          plugin_directory = "/vault/plugins"  # path of custom plugin binaries
          ha_storage "consul" {
            address              = "vault-consul-server:8500"
            path                 = "vault"
            scheme               = "http"
            tls_disable          = 1
          }
          service_registration "kubernetes" {}
          ui = true
          disable_mlock = true
          listener "tcp" {
            tls_disable = 1
            address         = "[::]:8200"
            cluster_address = "[::]:8201"
          }
          storage "s3" {}

But still our vault-active failure doesn’t happen on time to standby and we get the error W1114 14:54:49.600619 7 controller.go:1112] Service “vault/vault-active” does not have any active Endpoint.

What else is required to switch it immediately to standby as soon as vault active is down.

There will ALWAYS be a short period of downtime to promote a new node to active.

This is necessary to ensure the outgoing leader really has stopped writing to storage before a new one starts being active.

@maxb Thanks for the response but how much downtime is expected as our standby comes after 25 mins.

OK, 25 mins does seem a bit ridiculous, yes.

It’s impossible to give a specific time since it can vary so much with performance of storage backend, other configuration options, and size of data in Vault, but I’d generally expect something between a few seconds and a few minutes.

Please post your Vault server logs, to show what is happening during this extended period. Include logs from a period starting before an active node fails, through to after a new node has finished becoming active. Take the logs from the node that became active.

sure let me create a failover scenario again and send you logs

@maxb Which all logs will be helpful in analysing?
I have some from k8s cluster which are not sufficient to analyze. How can I get vault server logs ?

vault-1

==> Vault server configuration:

              HA Storage: consul
             Api Address: http://10.110.182.49:8200
                     Cgo: disabled
         Cluster Address: https://vault-1.vault-internal:8201
              Go Version: go1.17.13
              Listener 1: tcp (addr: "[::]:8200", cluster address: "[::]:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")
               Log Level: info
                   Mlock: supported: true, enabled: false
           Recovery Mode: false
                 Storage: s3
                 Version: Vault v1.11.3, built 2022-08-26T10:27:10Z
             Version Sha: 17250b25303c6418c283c95b1d5a9c9f16174fe8

==> Vault server started! Log data will stream in below:

2022-11-28T17:08:26.928Z [INFO]  proxy environment: http_proxy="" https_proxy="" no_proxy=""
2022-11-28T17:08:27.720Z [WARN]  ha.consul: appending trailing forward slash to path
2022-11-28T17:08:27.909Z [INFO]  core: Initializing version history cache for core
2022-11-28T17:08:27.913Z [INFO]  core: stored unseal keys supported, attempting fetch
2022-11-28T17:08:28.076Z [INFO]  core.cluster-listener.tcp: starting listener: listener_address=[::]:8201
2022-11-28T17:08:28.076Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2022-11-28T17:08:28.080Z [INFO]  core: vault is unsealed
2022-11-28T17:08:28.082Z [INFO]  core: entering standby mode
2022-11-28T17:08:28.138Z [INFO]  core: unsealed with stored key
2022-11-28T17:40:24.291Z [ERROR] core: failed to acquire lock: error="failed to read lock: Unexpected response code: 500 (rpc error making call: EOF)"

vault-0

==> Vault server configuration:

              HA Storage: consul
             Api Address: http://10.110.205.115:8200
                     Cgo: disabled
         Cluster Address: https://vault-0.vault-internal:8201
              Go Version: go1.17.13
              Listener 1: tcp (addr: "[::]:8200", cluster address: "[::]:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")
               Log Level: info
                   Mlock: supported: true, enabled: false
           Recovery Mode: false
                 Storage: s3
                 Version: Vault v1.11.3, built 2022-08-26T10:27:10Z
             Version Sha: 17250b25303c6418c283c95b1d5a9c9f16174fe8

==> Vault server started! Log data will stream in below:

2022-10-25T16:05:26.019Z [INFO]  proxy environment: http_proxy="" https_proxy="" no_proxy=""
2022-10-25T16:05:26.791Z [WARN]  ha.consul: appending trailing forward slash to path
2022-10-25T16:05:26.931Z [INFO]  core: Initializing version history cache for core
2022-10-25T16:05:26.936Z [INFO]  core: stored unseal keys supported, attempting fetch
2022-10-25T16:05:27.078Z [INFO]  core.cluster-listener.tcp: starting listener: listener_address=[::]:8201
2022-10-25T16:05:27.078Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2022-10-25T16:05:27.082Z [INFO]  core: vault is unsealed
2022-10-25T16:05:27.083Z [INFO]  core: entering standby mode
2022-10-25T16:05:27.126Z [INFO]  core: unsealed with stored key
2022-10-26T01:03:24.210Z [ERROR] core: failed to acquire lock: error="failed to read lock: Unexpected response code: 500 (rpc error getting client: failed to get conn: dial tcp <nil>->10.110.220.235:8300: connect: connection refused)"

vault-2

Defaulted container "vault" out of: vault, audit-log-rotator, fluent-bit, artifactory-secrets-plugin (init)
==> Vault server configuration:

              HA Storage: consul
             Api Address: http://10.110.214.90:8200
                     Cgo: disabled
         Cluster Address: https://vault-2.vault-internal:8201
              Go Version: go1.17.13
              Listener 1: tcp (addr: "[::]:8200", cluster address: "[::]:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")
               Log Level: info
                   Mlock: supported: true, enabled: false
           Recovery Mode: false
                 Storage: s3
                 Version: Vault v1.11.3, built 2022-08-26T10:27:10Z
             Version Sha: 17250b25303c6418c283c95b1d5a9c9f16174fe8

2022-11-27T17:05:34.840Z [INFO]  proxy environment: http_proxy="" https_proxy="" no_proxy=""
2022-11-27T17:05:35.713Z [WARN]  ha.consul: appending trailing forward slash to path
2022-11-27T17:05:35.844Z [INFO]  core: Initializing version history cache for core
2022-11-27T17:05:35.857Z [INFO]  core: stored unseal keys supported, attempting fetch
==> Vault server started! Log data will stream in below:

2022-11-27T17:05:35.971Z [INFO]  core.cluster-listener.tcp: starting listener: listener_address=[::]:8201
2022-11-27T17:05:35.971Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2022-11-27T17:05:35.973Z [INFO]  core: vault is unsealed
2022-11-27T17:05:35.975Z [INFO]  core: entering standby mode
2022-11-27T17:05:36.067Z [INFO]  core: unsealed with stored key
2022-11-27T17:10:13.728Z [INFO]  core: acquired lock, enabling active operation
2022-11-27T17:10:13.993Z [INFO]  core: post-unseal setup starting
2022-11-27T17:10:14.069Z [INFO]  core: loaded wrapping token key
2022-11-27T17:10:14.141Z [INFO]  core: upgrading plugin information: plugins=[]
2022-11-27T17:10:14.141Z [INFO]  core: successfully setup plugin catalog: plugin-directory=/vault/plugins
2022-11-27T17:10:14.371Z [WARN]  secrets.artifactory.artifactory_458cff14.artifactory: error closing client during Kill: metadata=true err="rpc error: code = Canceled desc = grpc: the client connection is closing"
2022-11-27T17:10:14.371Z [WARN]  secrets.artifactory.artifactory_458cff14.artifactory: plugin failed to exit gracefully: metadata=true
2022-11-27T17:10:14.373Z [ERROR] secrets.artifactory.artifactory_458cff14.artifactory: plugin process exited: metadata=true path=/vault/plugins/artifactory pid=754 error="signal: killed"
2022-11-27T17:10:14.373Z [INFO]  core: successfully mounted backend: type=artifactory path=artifactory/
2022-11-27T17:10:14.379Z [INFO]  core: successfully mounted backend: type=identity path=identity/
2022-11-27T17:10:14.401Z [INFO]  core: successfully mounted backend: type=consul path=consul/
2022-11-27T17:10:14.442Z [INFO]  core: successfully mounted backend: type=kv path=workload/
2022-11-27T17:10:14.469Z [INFO]  core: successfully mounted backend: type=kv path=platform/
2022-11-27T17:10:14.493Z [INFO]  core: successfully mounted backend: type=kv path=kv/
2022-11-27T17:10:14.494Z [INFO]  core: successfully mounted backend: type=system path=sys/
2022-11-27T17:10:14.517Z [INFO]  core: successfully mounted backend: type=pki path=pki/
2022-11-27T17:10:14.517Z [INFO]  core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2022-11-27T17:10:14.538Z [INFO]  core: successfully mounted backend: type=aws path=plt-aws/root/
2022-11-27T17:10:14.538Z [INFO]  core: successfully mounted backend: type=aws path=aws/9998-wkl-a/
2022-11-27T17:10:14.538Z [INFO]  core: successfully mounted backend: type=aws path=aws/9998-wkl-t/
2022-11-27T17:10:14.538Z [INFO]  core: successfully mounted backend: type=aws path=aws/9998-wkl-p/
2022-11-27T17:10:14.538Z [INFO]  core: successfully mounted backend: type=aws path=aws/9998-wkl-d/
2022-11-27T17:10:14.780Z [INFO]  core: successfully enabled credential backend: type=token path=token/ namespace="ID: root. Path: "
2022-11-27T17:10:14.801Z [INFO]  core: successfully enabled credential backend: type=approle path=approle/ namespace="ID: root. Path: "
2022-11-27T17:10:14.823Z [INFO]  core: successfully enabled credential backend: type=okta path=okta/ namespace="ID: root. Path: "
2022-11-27T17:10:14.845Z [INFO]  core: successfully enabled credential backend: type=jwt path=gitlab-jwt/ namespace="ID: root. Path: "
2022-11-27T17:10:14.855Z [INFO]  core: successfully enabled credential backend: type=aws path=aws/ namespace="ID: root. Path: "
2022-11-27T17:10:14.906Z [INFO]  core: restoring leases
2022-11-27T17:10:14.908Z [INFO]  rollback: starting rollback manager
2022-11-27T17:10:15.197Z [INFO]  expiration: lease restore complete
2022-11-27T17:10:15.266Z [INFO]  identity: entities restored
2022-11-27T17:10:15.281Z [INFO]  identity: groups restored
2022-11-27T17:10:15.374Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
2022-11-27T17:10:15.481Z [INFO]  core: usage gauge collection is disabled
2022-11-27T17:10:15.574Z [INFO]  core: post-unseal setup complete
2022-11-27T17:11:15.593Z [INFO]  expiration: revoked lease: lease_id=sys/wrapping/wrap/h2f6eb02553a89833c00c6475958c4f1cfdb45d039a4e8d959381303ba39db82d
==> Vault reload triggered
2022-11-27T17:30:00.639Z [INFO]  audit: reloading file audit backend: path=file/
==> Vault reload triggered
2022-11-27T18:00:00.589Z [INFO]  audit: reloading file audit backend: path=file/
==> Vault reload triggered
2022-11-27T18:30:00.601Z [INFO]  audit: reloading file audit backend: path=file/
==> Vault reload triggered
2022-11-27T19:00:00.608Z [INFO]  audit: reloading file audit backend: path=file/
2022-11-27T19:30:00.617Z [INFO]  audit: reloading file audit backend: path=file/
==> Vault reload triggered
2022-11-27T20:00:00.641Z [INFO]  audit: reloading file audit backend: path=file/
==> Vault reload triggered
==> Vault reload triggered
2022-11-27T20:30:00.649Z [INFO]  audit: reloading file audit backend: path=file/
==> Vault reload triggered
2022-11-27T21:00:00.658Z [INFO]  audit: reloading file audit backend: path=file/
==> Vault reload triggered
2022-11-27T21:30:00.665Z [INFO]  audit: reloading file audit backend: path=file/
==> Vault reload triggered
2022-11-27T22:00:00.672Z [INFO]  audit: reloading file audit backend: path=file/
==> Vault reload triggered
2022-11-27T22:30:00.679Z [INFO]  audit: reloading file audit backend: path=file/
==> Vault reload triggered
2022-11-27T23:00:00.686Z [INFO]  audit: reloading file audit backend: path=file/
==> Vault reload triggered
2022-11-27T23:30:00.693Z [INFO]  audit: reloading file audit backend: path=file/
==> Vault reload triggered
2022-11-28T00:00:00.703Z [INFO]  audit: reloading file audit backend: path=file/
==> Vault reload triggered

@maxb here are the logs from the three vault pods.Can you please help?

@maxb At present we have 1 active and 2 standby using consul as storage which took 45 sec to failover … Can we have two active and one standby config for vault?

Vault is a leadsership based single master system, so you can have as many standby nodes as you like but only every one active - which is elected from amongst all the nodes (i.e. you don’t choose which node is active)

You’ve given 3 sets of logs, but they don’t relate to the same time periods, and none of them show a complete time period including some remote node stopping being active, and the node writing the logs stepping up to become active.

It may be advantageous to increase the log level to DEBUG too.

Also, could you give more details about your Consul setup? Some of the log lines mention Vault not getting a valid response from Consul: