Vault version: 1.2.2
GKE cluster version 1.14
We had HA vault GKE cluster fail because a new master was not elected. We run private vault cluster with HA settings on GKE with 3 replicas as a statefulset deployment. To restore vault cluster we had to scale it to 0 and back to 3 - adding additional nodes or restarting them did not help.
This are the logs:
a. Error received on our application side:
“local node not active but active cluster node not found”
b. Vault side logs:
{“log”:“xxxxxT15:03:00.111Z [WARN] core: leadership lost, stopping active operation\n”,“stream”:“stderr”}
c. After looking at logs from days prior to vault failing, we saw this logs:
{“log”:“xxxxxxxT15:02:59.960Z [ERROR] core: barrier init check failed: error=“failed to check for initialization: failed to read object: Get https://www.googleapis.com/storage/v1/b//o?alt=json\u0026delimiter=%2F\u0026pageToken=\u0026prefix=core%2F\u0026prettyPrint=false\u0026projection=full\u0026versions=false: read tcp x.x.x.x:43102-\u003ex.x.x.x:443: read: connection reset by peer”\n”,“stream”:“stderr”,“time”:“xxxxxxT15:02:59.964582898Z”}
We were partially following this google’s guide when creating vault:
https://codelabs.developers.google.com/codelabs/vault-on-gke/index.html
Few people confirmed our setup was configured OK.
We believe but not 100% sure this is a relevant issue:
Is it really Go language bug and as a result vault cant be 100% HA?
Did someone have similar problem and can share more details?
How to prevent this from happening again?
Thanks
EDIT:
We also contacted google and they think its Go language problem too