Hey folks,
I have a 3-node HA cluster of Vault OSS v1.5.2 in Azure with an Azure LB (regular, not App Gateway) and MySQL storage backend. The VMs are in a scale set, if it matters.
Individual nodes come up fine and form a cluster as expected. I haven’t tested the functionality of Vault itself yet but I imagine it works fine.
My problem like the title says is that a new leader appears to get elected every few minutes.
This is what I see in the leader node log when it loses leadership:
Aug 28 12:15:06 vault-vm000008 vault[9964]: 2020-08-28T12:15:06.956Z [WARN] core: leadership lost, stopping active operation
Aug 28 12:15:06 vault-vm000008 vault[9964]: 2020-08-28T12:15:06.956Z [INFO] core: pre-seal teardown starting
Aug 28 12:15:07 vault-vm000008 vault[9964]: 2020-08-28T12:15:07.456Z [INFO] rollback: stopping rollback manager
Aug 28 12:15:07 vault-vm000008 vault[9964]: 2020-08-28T12:15:07.456Z [INFO] core: pre-seal teardown complete
Aug 28 12:15:07 vault-vm000008 vault[9964]: [mysql] 2020/08/28 12:15:07 connection.go:135: write tcp 10.51.7.11:43538->13.68.105.208:3306: write: broken pipe
Aug 28 12:15:07 vault-vm000008 vault[9964]: [mysql] 2020/08/28 12:15:07 connection.go:135: write tcp 10.51.7.11:43148->13.68.105.208:3306: write: broken pipe
Aug 28 12:15:07 vault-vm000008 vault[9964]: 2020-08-28T12:15:07.459Z [ERROR] core: unlocking HA lock failed: error="mysql: unable to release lock, already released or not held by this session"
Aug 28 12:15:07 vault-vm000008 vault[9964]: 2020-08-28T12:15:07.460Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
13.68.105.208 above is the IP address of my Azure MySQL instance.
And this is what I see on the new leader:
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.830Z [INFO] core: acquired lock, enabling active operation
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.860Z [INFO] core: post-unseal setup starting
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.870Z [INFO] core: loaded wrapping token key
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.870Z [INFO] core: successfully setup plugin catalog: plugin-directory=
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.872Z [INFO] core: successfully mounted backend: type=system path=sys/
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.872Z [INFO] core: successfully mounted backend: type=identity path=identity/
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.872Z [INFO] core: successfully mounted backend: type=cubbyhole path=cubbyhole/
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.878Z [INFO] core: successfully enabled credential backend: type=token path=token/
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.879Z [INFO] core: restoring leases
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.879Z [INFO] rollback: starting rollback manager
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.880Z [INFO] expiration: lease restore complete
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.881Z [INFO] identity: entities restored
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.882Z [INFO] identity: groups restored
Aug 28 12:15:09 vault-vm000007 vault[9921]: 2020-08-28T12:15:09.343Z [INFO] core: usage gauge collection is disabled
Aug 28 12:15:09 vault-vm000007 vault[9921]: 2020-08-28T12:15:09.353Z [INFO] core: post-unseal setup complete
And lastly, here’s my config file:
listener "tcp" {
address = "127.0.0.1:8200"
tls_cert_file = "/etc/vault.d/certificate.crt"
tls_key_file = "/etc/vault.d/certificate.pem"
}
listener "tcp" {
address = "LOCAL_IP_ADDRESS:8200"
tls_cert_file = "/etc/vault.d/certificate.crt"
tls_key_file = "/etc/vault.d/certificate.pem"
}
storage "mysql" {
address = "DB_HOST"
tls_ca_file = "/etc/vault.d/azure_mysql_tls_ca.crt"
username = "DB_USERNAME"
password = "DB_PASSWORD"
database = "DB_NAME"
ha_enabled = "true"
}
seal "azurekeyvault" {
tenant_id = "${azure_tenant_id}"
vault_name = "${key_vault_name}"
key_name = "${key_vault_key_name}"
}
api_addr = "https://LOCAL_IP_ADDRESS:8200"
cluster_addr = "https://LOCAL_IP_ADDRESS:8201"
ui = true
This happens every 120 seconds, so I’m guessing it’s due to some timeout somewhere and perhaps has to do with a MySQL setting. But I don’t know MySQL well at all and wouldn’t know what to look for.
I’ve tried with api_addr
set to the hostname of the Azure LB, as well as set to the individual nodes’ IP address as shown above. Didn’t help, so I’ve set it back to the Azure LB hostname.
I’ve also tried with and without the cluster_addr
line. Also didn’t help.
If it matters, the individual nodes can connect to each other on ports 8200-8201, but access to port 8200 from an external network is blocked, only the load balancer can reach the individual nodes. That’s why my api_addr
is set to the LB hostname (per the docs).
I have tight and fast timing settings on the Azure LB health probe and it catches the new leader within 10 seconds, but that could still be a problem given that it happens so often. The nodes can gossip fine on port 8201 (or I guess leader election wouldn’t work at all).
So I was hoping someone might have encountered this before and can share how they’ve fixed it. Thanks!