Vault pod stuck with no `Active Node Address` causing `local node not active but active cluster node not found`

I have a k8s vault deployment that has been working fine for a very long time ( and duplicated to multiple gke clusters ) using GCS as storage and HA_storage

Yesterday after a routine pod restart of one of the vault pods this specific pod has failed to rejoin the cluster properly

➜ vault status
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    5
Threshold                3
Version                  1.8.5
Storage Type             gcs
Cluster Name             xxxxx
Cluster ID               xxxxx
HA Enabled               true
HA Cluster               n/a
HA Mode                  standby
Active Node Address      <none>

Normally when i had this kind of issues was always due to failing to access the ha_storage backend and so to discover the other nodes but I confirmed access to GCS is working fine and no errors are printed in the logs.

I started another on-demand pod with TRACE logs enabled and no errors are printed at all, the startup looks ok but then it keeps printing [TRACE] core: found new active node information, refreshing without ever doing anything ( it would seem ) with the information it gathered

==> Vault server configuration:                                                                                                                                                                                    
                                                                                                                                                                                                                   
             Api Address: https://vault.XXXXXX
                     Cgo: disabled                                                                                                                                                                                 
         Cluster Address: https://10.118.72.144:8201                                                                                                                                                               
              Go Version: go1.16.15                                                                                                                                                                                
              Listener 1: tcp (addr: "10.118.72.144:8200", cluster address: "10.118.72.144:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled", x_forwarded_for_authorized_addrs: "
[10.118.0.0/16]", x_forwarded_for_hop_skips: "0", x_forwarded_for_reject_not_authorized: "false", x_forwarded_for_reject_not_present: "false")                                                                     
              Listener 2: tcp (addr: "127.0.0.1:8200", cluster address: "127.0.0.1:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")                                            
               Log Level: trace                                                                                                                                                                                    
                   Mlock: supported: true, enabled: true
           Recovery Mode: false
                 Storage: gcs (HA available)
                 Version: Vault v1.8.12
             Version Sha: 20187f81c3cc573f2223ef255a5679d921dfa365

2022-06-22T09:30:17.889Z [INFO]  proxy environment: http_proxy="" https_proxy="" no_proxy=""
2022-06-22T09:30:17.889Z [DEBUG] storage.gcs: configuring backend
2022-06-22T09:30:17.889Z [DEBUG] storage.gcs: creating client
2022-06-22T09:30:17.891Z [DEBUG] storage.gcs: configuration: bucket=XXXXXX chunk_size=8388608 ha_enabled=true max_parallel=0
2022-06-22T09:30:17.891Z [DEBUG] storage.gcs: creating client
2022-06-22T09:30:18.149Z [DEBUG] core: set config: sanitized config="{\"api_addr\":\"https://vault.XXXXX\",\"cache_size\":0,\"cluster_addr\":\"http://10.118.72.144:8201\",\"cluste
r_cipher_suites\":\"\",\"cluster_name\":\"XXXXXX\",\"default_lease_ttl\":0,\"default_max_request_duration\":0,\"disable_cache\":false,\"disable_clustering\":false,\"disable_indexing\":false,\"
disable_mlock\":false,\"disable_performance_standby\":false,\"disable_printable_check\":false,\"disable_sealwrap\":false,\"disable_sentinel_trace\":false,\"enable_response_header_hostname\":false,\"enable_respon
se_header_raft_node_id\":false,\"enable_ui\":true,\"listeners\":[{\"config\":{\"address\":\"10.118.72.144:8200\",\"tls_disable\":\"true\",\"x_forwarded_for_authorized_addrs\":[\"10.118.0.0/16\"],\"x_forwarded_fo
r_reject_not_authorized\":\"false\",\"x_forwarded_for_reject_not_present\":\"false\"},\"type\":\"tcp\"},{\"config\":{\"address\":\"127.0.0.1:8200\",\"tls_disable\":\"true\"},\"type\":\"tcp\"}],\"log_format\":\"u
nspecified\",\"log_level\":\"\",\"max_lease_ttl\":0,\"pid_file\":\"\",\"plugin_directory\":\"\",\"raw_storage_endpoint\":false,\"seals\":[{\"disabled\":false,\"type\":\"gcpckms\"}],\"storage\":{\"cluster_addr\":
\"http://10.118.72.144:8201\",\"disable_clustering\":false,\"redirect_addr\":\"https://vault.XXXXX\",\"type\":\"gcs\"},\"telemetry\":{\"add_lease_metrics_namespace_labels\":false,
\"circonus_api_app\":\"\",\"circonus_api_token\":\"\",\"circonus_api_url\":\"\",\"circonus_broker_id\":\"\",\"circonus_broker_select_tag\":\"\",\"circonus_check_display_name\":\"\",\"circonus_check_force_metric_
activation\":\"\",\"circonus_check_id\":\"\",\"circonus_check_instance_id\":\"\",\"circonus_check_search_tag\":\"\",\"circonus_check_tags\":\"\",\"circonus_submission_interval\":\"\",\"circonus_submission_url\":
\"\",\"disable_hostname\":true,\"dogstatsd_addr\":\"\",\"dogstatsd_tags\":null,\"lease_metrics_epsilon\":3600000000000,\"maximum_gauge_cardinality\":500,\"metrics_prefix\":\"\",\"num_lease_metrics_buckets\":168,
\"prometheus_retention_time\":86400000000000,\"stackdriver_debug_logs\":false,\"stackdriver_location\":\"\",\"stackdriver_namespace\":\"\",\"stackdriver_project_id\":\"\",\"statsd_address\":\"127.0.0.1:8125\",\"
statsite_address\":\"\",\"usage_gauge_period\":600000000000}}"
2022-06-22T09:30:18.149Z [DEBUG] storage.cache: creating LRU cache: size=0
2022-06-22T09:30:18.241Z [DEBUG] cluster listener addresses synthesized: cluster_addresses=[10.118.72.144:8201, 127.0.0.1:8201]
2022-06-22T09:30:18.242Z [DEBUG] would have sent systemd notification (systemd not present): notification=READY=1
2022-06-22T09:30:18.242Z [INFO]  core: stored unseal keys supported, attempting fetch
==> Vault server started! Log data will stream in below:


vault-test-1 vault 2022-06-22T09:50:54.416Z [INFO]  proxy environment: http_proxy="" https_proxy="" no_proxy=""                                                                                                    
vault-test-1 vault 2022-06-22T09:50:54.416Z [DEBUG] storage.gcs: configuring backend                                                                                                                               
vault-test-1 vault 2022-06-22T09:50:54.416Z [DEBUG] storage.gcs: creating client                                                                                                                                   
vault-test-1 vault 2022-06-22T09:50:54.418Z [DEBUG] storage.gcs: configuration: bucket=XXXXXX chunk_size=8388608 ha_enabled=true max_parallel=0                                                 
vault-test-1 vault 2022-06-22T09:50:54.418Z [DEBUG] storage.gcs: creating client                                                                                                                                   
vault-test-1 vault 2022-06-22T09:50:54.711Z [DEBUG] core: set config: sanitized config={"api_addr":"https://vault.XXXXX","cache_size":0,"cluster_addr":"http://10.118.72.145:8201",
"cluster_cipher_suites":"","cluster_name":"XXXXXXX","default_lease_ttl":0,"default_max_request_duration":0,"disable_cache":false,"disable_clustering":false,"disable_indexing":false,"disable_mlo
ck":false,"disable_performance_standby":false,"disable_printable_check":false,"disable_sealwrap":false,"disable_sentinel_trace":false,"enable_response_header_hostname":false,"enable_response_header_raft_node_id"
:false,"enable_ui":true,"listeners":[{"config":{"address":"10.118.72.145:8200","tls_disable":"true","x_forwarded_for_authorized_addrs":["10.118.0.0/16"],"x_forwarded_for_reject_not_authorized":"false","x_forward
ed_for_reject_not_present":"false"},"type":"tcp"},{"config":{"address":"127.0.0.1:8200","tls_disable":"true"},"type":"tcp"}],"log_format":"unspecified","log_level":"","max_lease_ttl":0,"pid_file":"","plugin_dire
ctory":"","raw_storage_endpoint":false,"seals":[{"disabled":false,"type":"gcpckms"}],"storage":{"cluster_addr":"http://10.118.72.145:8201","disable_clustering":false,"redirect_addr":"https://vault.XXXXXXX","type":"gcs"},"telemetry":{"add_lease_metrics_namespace_labels":false,"circonus_api_app":"","circonus_api_token":"","circonus_api_url":"","circonus_broker_id":"","circonus_broker_select_
tag":"","circonus_check_display_name":"","circonus_check_force_metric_activation":"","circonus_check_id":"","circonus_check_instance_id":"","circonus_check_search_tag":"","circonus_check_tags":"","circonus_submi
ssion_interval":"","circonus_submission_url":"","disable_hostname":true,"dogstatsd_addr":"","dogstatsd_tags":null,"lease_metrics_epsilon":3600000000000,"maximum_gauge_cardinality":500,"metrics_prefix":"","num_le
ase_metrics_buckets":168,"prometheus_retention_time":86400000000000,"stackdriver_debug_logs":false,"stackdriver_location":"","stackdriver_namespace":"","stackdriver_project_id":"","statsd_address":"127.0.0.1:812
5","statsite_address":"","usage_gauge_period":600000000000}}                                                                                                                                                       
vault-test-1 vault 2022-06-22T09:50:54.711Z [DEBUG] storage.cache: creating LRU cache: size=0                                                                                                                      
vault-test-1 vault 2022-06-22T09:50:54.844Z [DEBUG] cluster listener addresses synthesized: cluster_addresses=[10.118.72.145:8201, 127.0.0.1:8201]                                                                 
vault-test-1 vault 2022-06-22T09:50:54.845Z [DEBUG] would have sent systemd notification (systemd not present): notification=READY=1                                                                               
vault-test-1 vault 2022-06-22T09:50:54.845Z [INFO]  core: stored unseal keys supported, attempting fetch
vault-test-1 vault 2022-06-22T09:50:54.962Z [DEBUG] core: starting cluster listeners
vault-test-1 vault 2022-06-22T09:50:54.962Z [INFO]  core.cluster-listener.tcp: starting listener: listener_address=10.118.72.145:8201
vault-test-1 vault 2022-06-22T09:50:54.962Z [INFO]  core.cluster-listener.tcp: starting listener: listener_address=127.0.0.1:8201
vault-test-1 vault 2022-06-22T09:50:54.962Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=10.118.72.145:8201
vault-test-1 vault 2022-06-22T09:50:54.962Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=127.0.0.1:8201
vault-test-1 vault 2022-06-22T09:50:54.963Z [INFO]  core: vault is unsealed
vault-test-1 vault 2022-06-22T09:50:54.963Z [INFO]  core: unsealed with stored key
vault-test-1 vault 2022-06-22T09:50:54.963Z [INFO]  core: entering standby mode
vault-test-1 vault 2022-06-22T09:50:57.495Z [TRACE] core: found new active node information, refreshing


vault-test-1 vault 2022-06-22T09:50:59.983Z [TRACE] core: found new active node information, refreshing


vault-test-1 vault 2022/06/22 09:51:00 [ERR] Error flushing to statsd! Err: write udp 127.0.0.1:46574->127.0.0.1:8125: write: connection refused
vault-test-1 vault 2022-06-22T09:51:02.480Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022-06-22T09:51:04.988Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022/06/22 09:51:07 [ERR] Error flushing to statsd! Err: write udp 127.0.0.1:51053->127.0.0.1:8125: write: connection refused
vault-test-1 vault 2022-06-22T09:51:07.490Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022-06-22T09:51:09.988Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022-06-22T09:51:10.178Z [TRACE] core: found new active node information, refreshing

vault-test-1 vault 2022-06-22T09:51:12.490Z [TRACE] core: found new active node information, refreshing


vault-test-1 vault 2022/06/22 09:51:13 [ERR] Error flushing to statsd! Err: write udp 127.0.0.1:40666->127.0.0.1:8125: write: connection refused
vault-test-1 vault 2022-06-22T09:51:14.994Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022-06-22T09:51:17.492Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022-06-22T09:51:19.990Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022/06/22 09:51:20 [ERR] Error flushing to statsd! Err: write udp 127.0.0.1:47160->127.0.0.1:8125: write: connection refused
vault-test-1 vault 2022-06-22T09:51:22.495Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022-06-22T09:51:24.991Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022/06/22 09:51:25 [ERR] Error flushing to statsd! Err: write udp 127.0.0.1:46119->127.0.0.1:8125: write: connection refused
vault-test-1 vault 2022-06-22T09:51:27.493Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022-06-22T09:51:29.617Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022-06-22T09:51:30.001Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022/06/22 09:51:32 [ERR] Error flushing to statsd! Err: write udp 127.0.0.1:39911->127.0.0.1:8125: write: connection refused
vault-test-1 vault 2022-06-22T09:51:32.496Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022-06-22T09:51:34.995Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022-06-22T09:51:37.498Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022/06/22 09:51:37 [ERR] Error flushing to statsd! Err: write udp 127.0.0.1:56269->127.0.0.1:8125: write: connection refused
vault-test-1 vault 2022-06-22T09:51:40.004Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022-06-22T09:51:42.492Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022/06/22 09:51:44 [ERR] Error flushing to statsd! Err: write udp 127.0.0.1:55360->127.0.0.1:8125: write: connection refused
vault-test-1 vault 2022-06-22T09:51:44.994Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022-06-22T09:51:47.510Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022-06-22T09:51:49.999Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022/06/22 09:51:50 [ERR] Error flushing to statsd! Err: write udp 127.0.0.1:59818->127.0.0.1:8125: write: connection refused
vault-test-1 vault 2022-06-22T09:51:52.503Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022-06-22T09:51:53.948Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022-06-22T09:51:54.998Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022/06/22 09:51:56 [ERR] Error flushing to statsd! Err: write udp 127.0.0.1:57403->127.0.0.1:8125: write: connection refused
vault-test-1 vault 2022-06-22T09:51:57.519Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022-06-22T09:52:00.003Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022-06-22T09:52:02.500Z [TRACE] core: found new active node information, refreshing
vault-test-1 vault 2022/06/22 09:52:02 [ERR] Error flushing to statsd! Err: write udp 127.0.0.1:41362->127.0.0.1:8125: write: connection refused
vault-test-1 vault 2022-06-22T09:52:05.000Z [TRACE] core: found new active node information, refreshing

for reference this is the configuration of the pod

api_addr     = "https://vault.XXXXX"
cluster_addr = "http://10.118.72.145:8201"
listener "tcp" {
  address     = "10.118.72.145:8200"
  tls_disable = "true"
  x_forwarded_for_authorized_addrs = ["10.118.0.0/16"]
  x_forwarded_for_reject_not_present = "false"
  x_forwarded_for_reject_not_authorized = "false"
}
listener "tcp" {
  address     = "127.0.0.1:8200"
  tls_disable = "true"
}
cluster_name = "XXXXXXXX"
ui           = true
telemetry {
  statsd_address   = "127.0.0.1:8125"
  disable_hostname = true
}
// raw_storage_endpoint = true

storage "gcs" {
  bucket = "XXXXXXXXXX"
  ha_enabled = "true"
}

seal "gcpckms" {
  project     = "XXXXXXXX"
  region      = "global"
  key_ring    = "XXXXXX"
  crypto_key  = "XXXXXXX"
}

any help would be extremely appreciated.

While investigated i noticed that this is the only cluster where in the GCS bucket i don’t have a core/leader/XXXXX file … something / someone must have deleted that file causing the new pods to fail to identify the leader

Hopefully a step down / failvoer of the active master will recreate that file

For reference, on vault operator step-down on the active master the missing file in the bucket was recreated and the whol HA setup worked.

nice find, I wouldn’t have thought of looking at the leaders’s storage.