HA is not really HA - vault could not elect a new master and failed - barrier init check failed

Alegloop · June 28, 2020, 10:58am

Vault version: 1.2.2
GKE cluster version 1.14

We had HA vault GKE cluster fail because a new master was not elected. We run private vault cluster with HA settings on GKE with 3 replicas as a statefulset deployment. To restore vault cluster we had to scale it to 0 and back to 3 - adding additional nodes or restarting them did not help.

This are the logs:

a. Error received on our application side:
“local node not active but active cluster node not found”

b. Vault side logs:
{“log”:“xxxxxT15:03:00.111Z [WARN] core: leadership lost, stopping active operation\n”,“stream”:“stderr”}

c. After looking at logs from days prior to vault failing, we saw this logs:

{“log”:“xxxxxxxT15:02:59.960Z [ERROR] core: barrier init check failed: error=“failed to check for initialization: failed to read object: Get https://www.googleapis.com/storage/v1/b//o?alt=json\u0026delimiter=%2F\u0026pageToken=\u0026prefix=core%2F\u0026prettyPrint=false\u0026projection=full\u0026versions=false: read tcp x.x.x.x:43102-\u003ex.x.x.x:443: read: connection reset by peer”\n”,“stream”:“stderr”,“time”:“xxxxxxT15:02:59.964582898Z”}

We were partially following this google’s guide when creating vault:
https://codelabs.developers.google.com/codelabs/vault-on-gke/index.html

Few people confirmed our setup was configured OK.

We believe but not 100% sure this is a relevant issue:

Is it really Go language bug and as a result vault cant be 100% HA?
Did someone have similar problem and can share more details?
How to prevent this from happening again?

Thanks

EDIT:

We also contacted google and they think its Go language problem too

Topic		Replies	Views
Vault pod stuck with no `Active Node Address` causing `local node not active but active cluster node not found` Vault	3	6361	June 23, 2022
Vault HA local node not active but active cluster node not found - Happened after GKE 1.25 upgrade Vault k8s , vault	0	402	August 1, 2023
Sporadic errors when accessing vault Vault vault	2	401	March 24, 2023
HA failover time Vault	10	2487	June 26, 2022
Operator init terminated with exit code 143 Vault k8s	6	1770	January 23, 2023

HA is not really HA - vault could not elect a new master and failed - barrier init check failed

Related topics