Vault sealed. failed to load local aliases from storage:

Today we’ve experienced a problem in our Vault (1.9.2), installed over EKS (1.21)and using Consul (1.11.2) as storage backend. During the night and “without changes being applied”, Vault got sealed.

Taking a look at the logs, after restarting all Vault pods, the logs showed the following:

*2022-05-10T07:51:55.062Z [ERROR] core: post-unseal setup failed: error="failed to load local aliases from storage: failed to read packed storage item: failed to read packed storage entry: Get \"http://10.14.37.10:8500/v1/kv/vault/logical/1dcddd60-bf76-19bd-1846-7eb72266581b/packer/local-aliases/buckets/94\": read tcp 10.14.37.24:47986->10.14.37.10:8500: read: connection reset by peer"*
*2022-05-10T07:51:55.064Z [ERROR] core: error checking health: error="Get \"http://10.14.37.10:8500/v1/kv/vault/core/cluster/local/info\": dial tcp 10.14.37.10:8500: connect: connection refused"*
*2022-05-10T07:51:55.064Z [ERROR] core: error checking health: error="Get \"http://10.14.37.10:8500/v1/kv/vault/core/cluster/local/info\": dial tcp 10.14.37.10:8500: connect: connection refused"*
*2022-05-10T07:51:55.065Z [ERROR] core: error checking health: error="Get \"http://10.14.37.10:8500/v1/kv/vault/core/cluster/local/info\": dial tcp 10.14.37.10:8500: connect: connection refused"*
*2022-05-10T07:51:55.065Z [ERROR] core: error checking health: error="Get \"http://10.14.37.10:8500/v1/kv/vault/core/cluster/local/info\": dial tcp 10.14.37.10:8500: connect: connection refused"*
*2022-05-10T07:51:55.065Z [DEBUG] core: finished triggering standbyStopCh for runStandby*
*2022-05-10T07:51:55.065Z [DEBUG] core: shutting down periodic key rotation checker*
*2022-05-10T07:51:55.070Z [DEBUG] core: shutting down periodic leader refresh*
*2022-05-10T07:51:55.070Z [DEBUG] core: shutting down leader elections*
*2022-05-10T07:51:55.067Z [ERROR] core: key rotation periodic upgrade check failed: error="Get \"http://10.14.37.10:8500/v1/kv/vault/core/upgrade/1\": dial tcp 10.14.37.10:8500: connect: connection refused"*
*2022-05-10T07:51:55.150Z [DEBUG] core: stop channel triggered in runStandby*
*2022-05-10T07:51:55.150Z [DEBUG] core: runStandby done*
*2022-05-10T07:51:55.150Z [INFO]  core: stopping cluster listeners*
*2022-05-10T07:51:55.150Z [INFO]  core.cluster-listener: forwarding rpc listeners stopped*
*2022-05-10T07:51:55.404Z [INFO]  core.cluster-listener: rpc listeners successfully shut down*
*2022-05-10T07:51:55.404Z [INFO]  core: cluster listeners successfully shut down*
*2022-05-10T07:51:55.404Z [DEBUG] core: sealing barrier*
*2022-05-10T07:51:55.430Z [INFO]  core: vault is sealed*

The only one solution to keep Vault unsealed was to configure Vault to talk directly to the consul-server, instead to the consul agents.

storage "consul" {
  path = "vault"
  address = "consul-server:8500"
  token = "xxxxxhiddenxxxxxxxx"
}

We know that this is a practice not recommended, but this way it doesn’t work anymore.

storage "consul" {
  path = "vault"
  address = "HOST_IP:8500"
  token = "xxxxxxxxxxxhiddenxxxxxxx"
}

Any ideas of what have happened?

Your Vault logs are just saying “I can no longer talk to Consul”.

To investigate what happened to the Consul agent you should now look at its logs.

Hello maxb, thanks for your answer.

Yes indeed. Following the logs of one of the consul-agents when restarted:

==> Starting Consul agent...
           Version: '1.11.2'
           Node ID: 'b36b29c3-c948-9de6-dabb-386075aa873a'
         Node name: 'ip-10-14-37-154.eu-west-1.compute.internal'
        Datacenter: 'xxxxx-consul' (Segment: '')
            Server: false (Bootstrap: false)
       Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: 8502, DNS: 8600)
      Cluster Addr: 10.14.37.170 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false

==> Log data will now stream in as it occurs:

2022-05-10T07:52:44.929Z [WARN]  agent: Node name "ip-10-14-37-154.eu-west-1.compute.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2022-05-10T07:52:45.224Z [WARN]  agent.auto_config: Node name "ip-10-14-37-154.eu-west-1.compute.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2022-05-10T07:52:45.226Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: ip-10-14-37-154.eu-west-1.compute.internal 10.14.37.170
2022-05-10T07:52:45.226Z [INFO]  agent.router: Initializing LAN area manager
2022-05-10T07:52:45.226Z [INFO]  agent.client.serf.lan: serf: Attempting re-join to previously known node: consul-server-0: 10.14.35.123:8301
2022-05-10T07:52:45.226Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=tcp
2022-05-10T07:52:45.226Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=udp
2022-05-10T07:52:45.325Z [INFO]  agent: Starting server: address=[::]:8500 network=tcp protocol=http
2022-05-10T07:52:45.325Z [WARN]  agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry { disable_compat_1.9 = true }` to disable them.
2022-05-10T07:52:45.325Z [INFO]  agent: Started gRPC server: address=[::]:8502 network=tcp
2022-05-10T07:52:45.325Z [INFO]  agent: Retry join is supported for the following discovery methods: cluster=LAN discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
2022-05-10T07:52:45.325Z [INFO]  agent: Joining cluster...: cluster=LAN
2022-05-10T07:52:45.325Z [INFO]  agent: (LAN) joining: lan_addresses=[consul-server-0.consul-server.vault.svc:8301, consul-server-1.consul-server.vault.svc:8301, consul-server-2.consul-server.vault.svc:8301, consul-server-3.consul-server.vault.svc:8301]
2022-05-10T07:52:45.325Z [INFO]  agent: started state syncer
2022-05-10T07:52:45.325Z [INFO]  agent: Consul agent running!
2022-05-10T07:52:45.325Z [WARN]  agent.router.manager: No servers available
2022-05-10T07:52:45.325Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2022-05-10T07:52:45.326Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: consul-server-0 10.14.35.123
2022-05-10T07:52:45.326Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: consul-server-2 10.14.37.105
2022-05-10T07:52:45.326Z [WARN]  agent.client.memberlist.lan: memberlist: Refuting a suspect message (from: ip-10-14-37-154.eu-west-1.compute.internal)
2022-05-10T07:52:45.326Z [INFO]  agent.client: adding server: server="consul-server-0 (Addr: tcp/10.14.35.123:8300) (DC: hiddenname-consul)"
2022-05-10T07:52:45.326Z [INFO]  agent.client: adding server: server="consul-server-2 (Addr: tcp/10.14.37.105:8300) (DC: hiddenname-consul)"
2022-05-10T07:52:45.326Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: ip-10-14-36-142.eu-west-1.compute.internal 10.14.36.251
2022-05-10T07:52:45.326Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: consul-server-3 10.14.37.34
2022-05-10T07:52:45.327Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: ip-10-14-37-10.eu-west-1.compute.internal 10.14.37.103
2022-05-10T07:52:45.327Z [INFO]  agent.client: adding server: server="consul-server-3 (Addr: tcp/10.14.37.34:8300) (DC: hiddenname-consul)"
2022-05-10T07:52:45.327Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: consul-server-1 10.14.36.175
2022-05-10T07:52:45.327Z [WARN]  agent: grpc: addrConn.createTransport failed to connect to {hiddenname-consul-10.14.35.123:8300 0 consul-server-0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp <nil>->10.14.35.123:8300: operation was canceled". Reconnecting...
2022-05-10T07:52:45.327Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: ip-10-14-35-148.eu-west-1.compute.internal 10.14.35.170
2022-05-10T07:52:45.327Z [INFO]  agent.client: adding server: server="consul-server-1 (Addr: tcp/10.14.36.175:8300) (DC: hiddenname-consul)"
2022-05-10T07:52:45.327Z [INFO]  agent.client.serf.lan: serf: Re-joined to previously known node: consul-server-0: 10.14.35.123:8301
2022-05-10T07:52:45.327Z [WARN]  agent: grpc: addrConn.createTransport failed to connect to {hiddenname-consul-10.14.35.123:8300 0 consul-server-0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp <nil>->10.14.35.123:8300: operation was canceled". Reconnecting...
2022-05-10T07:52:45.573Z [INFO]  agent: (LAN) joined: number_of_nodes=4
2022-05-10T07:52:45.573Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=4
2022-05-10T07:52:47.624Z [INFO]  agent: Synced node info
2022-05-10T07:53:05.824Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/metrics from=10.14.37.183:39634 error="Permission denied"
2022-05-10T07:53:12.758Z [INFO]  agent: Newer Consul version available: new_version=1.12.0 current_version=1.11.2
2022-05-10T07:53:33.453Z [INFO]  agent.client.serf.lan: serf: EventMemberLeave: ip-10-14-36-142.eu-west-1.compute.internal 10.14.36.251
2022-05-10T07:53:35.734Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/metrics from=10.14.37.183:39634 error="Permission denied"
2022-05-10T07:53:49.889Z [INFO]  agent.client.memberlist.lan: memberlist: Updating address for left or failed node ip-10-14-36-142.eu-west-1.compute.internal from 10.14.36.251:8301 to 10.14.36.35:8301
2022-05-10T07:53:49.889Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: ip-10-14-36-142.eu-west-1.compute.internal 10.14.36.35
2022-05-10T07:54:05.735Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/metrics from=10.14.37.183:39634 error="Permission denied"

The most relevant message seems to be this one (10.14.35.123 is one of the server pods, consul-server-0):

2022-05-10T07:52:45.327Z [WARN]  agent: grpc: addrConn.createTransport failed to connect to {hiddenname-consul-10.14.35.123:8300 0 consul-server-0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp <nil>->10.14.35.123:8300: operation was canceled". Reconnecting...

Following the ERROR logs of one of the consul-server pods:

$ k logs consul-server-0 -f | grep ERROR
2022-05-10T06:53:17.551Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/metrics from=10.14.37.183:57698 error="Permission denied"
2022-05-10T06:53:47.551Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/metrics from=10.14.37.183:57698 error="Permission denied"
2022-05-10T06:54:08.456Z [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: read tcp 10.14.35.123:36692->10.14.36.140:8301: i/o timeout
2022-05-10T06:54:17.551Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/metrics from=10.14.37.183:57698 error="Permission denied"
2022-05-10T06:56:17.551Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/metrics from=10.14.37.183:57698 error="Permission denied"
2022-05-10T06:56:30.457Z [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: read tcp 10.14.35.123:40958->10.14.37.101:8301: i/o timeout
2022-05-10T06:56:30.725Z [ERROR] agent.server.rpc: multiplex conn accept failed: conn=from=10.14.37.101:36478 error="read tcp 10.14.35.123:8300->10.14.37.101:36478: read: connection reset by peer"
2022-05-10T06:56:47.550Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/metrics from=10.14.37.183:57698 error="Permission denied"
2022-05-10T07:03:17.551Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/metrics from=10.14.37.183:57698 error="Permission denied"
2022-05-10T07:03:26.455Z [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: read tcp 10.14.35.123:47460->10.14.37.101:8301: i/o timeout
2022-05-10T07:03:47.550Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/metrics from=10.14.37.183:57698 error="Permission denied"
2022-05-10T07:04:06.456Z [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: read tcp 10.14.35.123:45946->10.14.36.140:8301: i/o timeout
2022-05-10T07:04:17.550Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/metrics from=10.14.37.183:57698 error="Permission denied"
2022-05-10T07:17:17.551Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/metrics from=10.14.37.183:57698 error="Permission denied"
2022-05-10T07:17:40.025Z [ERROR] agent.server.memberlist.lan: memberlist: Received invalid msgType (71) from=10.14.35.123:38254
2022-05-10T07:17:47.551Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/metrics from=10.14.37.183:57698 error="Permission denied"
2022-05-10T07:17:54.590Z [ERROR] agent.server.memberlist.lan: memberlist: Received invalid msgType (71) from=10.14.35.123:38438
2022-05-10T07:18:17.551Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/metrics from=10.14.37.183:57698 error="Permission denied"
2022-05-10T07:49:17.551Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/metrics from=10.14.37.183:57698 error="Permission denied"
2022-05-10T07:49:25.318Z [ERROR] agent.server.rpc: Problem peeking connection: conn=from=10.14.35.170:60444 err=EOF

Seemed that there were no communication between agent and server pods.

Thanks in advance.

Apparently that warning is expected, according to New WARN in 1.10.0 caused by shuffling the servers in the gRPC ClientConn pool · Issue #10603 · hashicorp/consul · GitHub

Yes indeed. You’d better investigate/fix that next. Not having access to your network, I have nothing more I can usefully say.

Thanks maxb.

Yes, in a deeper investigation we’ve got to the same conclusion about that message, so we are again stuck.