Hello,
I’m having some issues when HA mode is configured.
We are using vault server 1.7.2 with both HA and backend storage postgresql DB.
Vault, at the beginning is configured in HA disabled mode on two standalone hosts. each standalone node has its own DB and the configuration point to the DB of the specific host.
At some point, both hosts are being paired into HA mode (active/standby). A virtual IP is configured to the “active” node. The DB on standby becomes readonly and the DB in the active node is writable. The data in the active node is replicated to the standby.
Vault is updating the configuration to HA mode on both hosts and the server is restarted in HA mode. For the database, the configuration is set to use virtual IP, so that both vault instances will point to the same writable DB.
Configuration example:
listener “tcp” {
address = “x.x.x.x:8200”
tls_cert_file = “cert.pem”
tls_key_file = “cert-key.pem”
}
api_addr = “https://xx.x.x.xxx:8200”
cluster_addr = “https://xx.x.x.xxx:8201”
storage “postgresql” {
connection_url = “postgres://user@xxx.xxx.x.xxx:5432/db?sslmode=verify-ca&sslrootcert=ca.pem&sslcert=cert.pem&sslkey=cert-key.pem”
ha_enabled = “true”
}
disable_mlock = true
This scenario works well. However, issues starts when switchover is done. In switchover, the VIP will switch to the other host. the DBs will switch roles - readonly will become writable and writable will restart and become readonly.
The virtual IP indeed points to the right DB.
On the “former standby” vault node errors are displayed:
core: key rotation periodic upgrade check failed: error=“write tcp xxx.xxx.xxx.xx:40358-><VIP address>:5432: write: connection reset by peer”
Then some issues are seen:
- vault server simply stops responding and gets timeout on requests. Only restarting the server will make it work again.
- also following restart, both nodes may some time be seem as active and one of them will become standby at a later time.
Could you please advise what may be the problem and how this can be fixed.
Thanks in Advance,
Amir
update: vault trace log before connection reset:
2021-06-06T15:54:36.611Z [DEBUG] storage.cache: creating LRU cache: size=0
2021-06-06T15:54:36.613Z [DEBUG] cluster listener addresses synthesized: cluster_addresses=[0.0.0.0:8201]
2021-06-06T15:54:36.615Z [DEBUG] would have sent systemd notification (systemd not present): notification=READY=1
2021-06-06T15:54:37.687Z [DEBUG] core: unseal key supplied: migrate=false
2021-06-06T15:54:37.695Z [DEBUG] core: starting cluster listeners
2021-06-06T15:54:37.695Z [INFO] core.cluster-listener.tcp: starting listener: listener_address=0.0.0.0:8201
2021-06-06T15:54:37.695Z [INFO] core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2021-06-06T15:54:37.695Z [INFO] core: vault is unsealed
2021-06-06T15:54:37.696Z [INFO] core: entering standby mode
2021-06-06T15:54:37.704Z [TRACE] core: found new active node information, refreshing
2021-06-06T15:54:37.735Z [DEBUG] core: parsing information for new active node: active_cluster_addr=https://xx.x.x.xxx:8201 active_redirect_addr=https://xx.x.x.xxx:8200
2021-06-06T15:54:37.735Z [DEBUG] core: refreshing forwarding connection
2021-06-06T15:54:37.735Z [DEBUG] core: clearing forwarding clients
2021-06-06T15:54:37.735Z [DEBUG] core: done clearing forwarding clients
2021-06-06T15:54:37.736Z [DEBUG] core: done refreshing forwarding connection
2021-06-06T15:54:37.737Z [DEBUG] core.cluster-listener: creating rpc dialer: alpn=req_fw_sb-act_v1 host=fw-d16ff282-1ec2-b858-4f46-5997964dbda4
2021-06-06T15:54:37.956Z [DEBUG] core.cluster-listener: performing client cert lookup
2021-06-06T16:10:26.696Z [ERROR] core: failed to acquire lock: error=“read tcp 192.168.253.11:40302->192.168.5.222:5432: read: connection reset by peer”