Vault HA cluster constantly electing a new leader

Hey folks,

I have a 3-node HA cluster of Vault OSS v1.5.2 in Azure with an Azure LB (regular, not App Gateway) and MySQL storage backend. The VMs are in a scale set, if it matters.

Individual nodes come up fine and form a cluster as expected. I haven’t tested the functionality of Vault itself yet but I imagine it works fine.

My problem like the title says is that a new leader appears to get elected every few minutes.

This is what I see in the leader node log when it loses leadership:

Aug 28 12:15:06 vault-vm000008 vault[9964]: 2020-08-28T12:15:06.956Z [WARN]  core: leadership lost, stopping active operation
Aug 28 12:15:06 vault-vm000008 vault[9964]: 2020-08-28T12:15:06.956Z [INFO]  core: pre-seal teardown starting
Aug 28 12:15:07 vault-vm000008 vault[9964]: 2020-08-28T12:15:07.456Z [INFO]  rollback: stopping rollback manager
Aug 28 12:15:07 vault-vm000008 vault[9964]: 2020-08-28T12:15:07.456Z [INFO]  core: pre-seal teardown complete
Aug 28 12:15:07 vault-vm000008 vault[9964]: [mysql] 2020/08/28 12:15:07 connection.go:135: write tcp 10.51.7.11:43538->13.68.105.208:3306: write: broken pipe
Aug 28 12:15:07 vault-vm000008 vault[9964]: [mysql] 2020/08/28 12:15:07 connection.go:135: write tcp 10.51.7.11:43148->13.68.105.208:3306: write: broken pipe
Aug 28 12:15:07 vault-vm000008 vault[9964]: 2020-08-28T12:15:07.459Z [ERROR] core: unlocking HA lock failed: error="mysql: unable to release lock, already released or not held by this session"
Aug 28 12:15:07 vault-vm000008 vault[9964]: 2020-08-28T12:15:07.460Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]

13.68.105.208 above is the IP address of my Azure MySQL instance.

And this is what I see on the new leader:

Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.830Z [INFO]  core: acquired lock, enabling active operation
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.860Z [INFO]  core: post-unseal setup starting
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.870Z [INFO]  core: loaded wrapping token key
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.870Z [INFO]  core: successfully setup plugin catalog: plugin-directory=
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.872Z [INFO]  core: successfully mounted backend: type=system path=sys/
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.872Z [INFO]  core: successfully mounted backend: type=identity path=identity/
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.872Z [INFO]  core: successfully mounted backend: type=cubbyhole path=cubbyhole/
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.878Z [INFO]  core: successfully enabled credential backend: type=token path=token/
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.879Z [INFO]  core: restoring leases
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.879Z [INFO]  rollback: starting rollback manager
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.880Z [INFO]  expiration: lease restore complete
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.881Z [INFO]  identity: entities restored
Aug 28 12:15:08 vault-vm000007 vault[9921]: 2020-08-28T12:15:08.882Z [INFO]  identity: groups restored
Aug 28 12:15:09 vault-vm000007 vault[9921]: 2020-08-28T12:15:09.343Z [INFO]  core: usage gauge collection is disabled
Aug 28 12:15:09 vault-vm000007 vault[9921]: 2020-08-28T12:15:09.353Z [INFO]  core: post-unseal setup complete

And lastly, here’s my config file:

listener "tcp" {
  address = "127.0.0.1:8200"
  tls_cert_file = "/etc/vault.d/certificate.crt"
  tls_key_file = "/etc/vault.d/certificate.pem"
}

listener "tcp" {
  address = "LOCAL_IP_ADDRESS:8200"
  tls_cert_file = "/etc/vault.d/certificate.crt"
  tls_key_file = "/etc/vault.d/certificate.pem"
}

storage "mysql" {
  address = "DB_HOST"
  tls_ca_file = "/etc/vault.d/azure_mysql_tls_ca.crt"
  username = "DB_USERNAME"
  password = "DB_PASSWORD"
  database = "DB_NAME"
  ha_enabled = "true"
}

seal "azurekeyvault" {
  tenant_id = "${azure_tenant_id}"
  vault_name = "${key_vault_name}"
  key_name = "${key_vault_key_name}"
}

api_addr = "https://LOCAL_IP_ADDRESS:8200"
cluster_addr = "https://LOCAL_IP_ADDRESS:8201"
ui = true

This happens every 120 seconds, so I’m guessing it’s due to some timeout somewhere and perhaps has to do with a MySQL setting. But I don’t know MySQL well at all and wouldn’t know what to look for.

I’ve tried with api_addr set to the hostname of the Azure LB, as well as set to the individual nodes’ IP address as shown above. Didn’t help, so I’ve set it back to the Azure LB hostname.

I’ve also tried with and without the cluster_addr line. Also didn’t help.

If it matters, the individual nodes can connect to each other on ports 8200-8201, but access to port 8200 from an external network is blocked, only the load balancer can reach the individual nodes. That’s why my api_addr is set to the LB hostname (per the docs).

I have tight and fast timing settings on the Azure LB health probe and it catches the new leader within 10 seconds, but that could still be a problem given that it happens so often. The nodes can gossip fine on port 8201 (or I guess leader election wouldn’t work at all).

So I was hoping someone might have encountered this before and can share how they’ve fixed it. Thanks!

I couldn’t find the Edit button again after using it once earlier so I’ll just add this as a reply to myself. I looked at the connections to the database and saw this:

+------+----------+------------------+------+---------+------+-----------+---------------------------------------------------------------------+
| Id   | User     | Host             | db   | Command | Time | State     | Info                                                                |
+------+----------+------------------+------+---------+------+-----------+---------------------------------------------------------------------+
| 3800 | vaultapp | 10.51.7.10:1416  | NULL | Sleep   |    2 |           | NULL                                                                |
| 3801 | vaultapp | 10.51.7.10:25218 | NULL | Sleep   |    2 |           | NULL                                                                |
| 3807 | vaultapp | 10.51.7.11:25729 | NULL | Sleep   |    3 |           | NULL                                                                |
| 3808 | vaultapp | 10.51.7.11:26244 | NULL | Sleep   |    8 |           | NULL                                                                |
| 3814 | vaultapp | 10.51.7.12:31876 | NULL | Sleep   |    2 |           | NULL                                                                |
| 3815 | vaultapp | 10.51.7.12:4994  | NULL | Sleep   |    2 |           | NULL                                                                |
| 3836 | vaultapp | 10.51.7.11:8960  | NULL | Sleep   |   73 |           | NULL                                                                |
| 3838 | vaultapp | 10.51.7.10:11968 | NULL | Execute |  191 | User lock | SELECT GET_LOCK('core/lock', 2147483647), IS_USED_LOCK('core/lock') |
| 3840 | vaultapp | 10.51.7.11:8194  | NULL | Sleep   |   73 |           | NULL                                                                |
| 3841 | vaultapp | 10.51.7.12:17026 | NULL | Execute |   71 | User lock | SELECT GET_LOCK('core/lock', 2147483647), IS_USED_LOCK('core/lock') |
| 3842 | vaultapp | 10.51.7.10:3073  | NULL | Query   |    0 | starting  | show full processlist                                               |
+------+----------+------------------+------+---------+------+-----------+---------------------------------------------------------------------+
11 rows in set (0.01 sec)

Is it weird that there are two different nodes running the SELECT GET_LOCK? In this case the two nodes in question were the standbys. The leader was process 3840, and on the same line, when the time field reached 120, a new leader was elected and the whole thing started over according to the same pattern.

Have you read the 1st paragraph on the mysql docs page?

Yes, I forgot that I’d read that, but I’ve read it. I think I didn’t pay attention to it because those two MySQL settings are set to 8 hours by default, and my leader re-election happens every 2 minutes, so I don’t see a connection between the two. I can certainly tweak them and see what happens.

Alright, I’ve updated wait_timeout and interactive_timeout on the MySQL instance from the default of 28800 (8 hours) to 172800 (48 hours). After I did that, I saw a couple more re-elections every 2 minutes, so I stopped and restarted Vault on each node. It’s now been 7 minutes and no leader election has taken place.

So thanks for the pointer!

The only thing now is I don’t know if those new values are a good idea or not, but since nothing can connect to that DB except Vault, I’m not too concerned about it.

My tl’dr would be - why not just use integrated storage to back Vault instead of MySQL?

MySQL is easier for us to manage. There’s a learning curve to manage Raft (I know from managing Consul), and troubleshooting can be problematic since most people have no experience with it, and we’re not paying Hashicorp for support. Whereas using Azure MySQL liberates us from a number of worries.