Vault HA with postgresql backend storage failure after DB shut down

Hello,

I’m having some issues when HA mode is configured.

We are using vault server 1.7.2 with both HA and backend storage postgresql DB.

Vault, at the beginning is configured in HA disabled mode on two standalone hosts. each standalone node has its own DB and the configuration point to the DB of the specific host.

At some point, both hosts are being paired into HA mode (active/standby). A virtual IP is configured to the “active” node. The DB on standby becomes readonly and the DB in the active node is writable. The data in the active node is replicated to the standby.

Vault is updating the configuration to HA mode on both hosts and the server is restarted in HA mode. For the database, the configuration is set to use virtual IP, so that both vault instances will point to the same writable DB.

Configuration example:
listener “tcp” {
address = “x.x.x.x:8200”
tls_cert_file = “cert.pem”
tls_key_file = “cert-key.pem”
}

api_addr = “https://xx.x.x.xxx:8200
cluster_addr = “https://xx.x.x.xxx:8201

storage “postgresql” {
connection_url = “postgres://user@xxx.xxx.x.xxx:5432/db?sslmode=verify-ca&sslrootcert=ca.pem&sslcert=cert.pem&sslkey=cert-key.pem”
ha_enabled = “true”
}
disable_mlock = true

This scenario works well. However, issues starts when switchover is done. In switchover, the VIP will switch to the other host. the DBs will switch roles - readonly will become writable and writable will restart and become readonly.

The virtual IP indeed points to the right DB.

On the “former standby” vault node errors are displayed:
core: key rotation periodic upgrade check failed: error=“write tcp xxx.xxx.xxx.xx:40358-><VIP address>:5432: write: connection reset by peer”

Then some issues are seen:

  1. vault server simply stops responding and gets timeout on requests. Only restarting the server will make it work again.
  2. also following restart, both nodes may some time be seem as active and one of them will become standby at a later time.

Could you please advise what may be the problem and how this can be fixed.

Thanks in Advance,
Amir

update: vault trace log before connection reset:

2021-06-06T15:54:36.611Z [DEBUG] storage.cache: creating LRU cache: size=0

2021-06-06T15:54:36.613Z [DEBUG] cluster listener addresses synthesized: cluster_addresses=[0.0.0.0:8201]

2021-06-06T15:54:36.615Z [DEBUG] would have sent systemd notification (systemd not present): notification=READY=1

2021-06-06T15:54:37.687Z [DEBUG] core: unseal key supplied: migrate=false

2021-06-06T15:54:37.695Z [DEBUG] core: starting cluster listeners

2021-06-06T15:54:37.695Z [INFO] core.cluster-listener.tcp: starting listener: listener_address=0.0.0.0:8201

2021-06-06T15:54:37.695Z [INFO] core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201

2021-06-06T15:54:37.695Z [INFO] core: vault is unsealed

2021-06-06T15:54:37.696Z [INFO] core: entering standby mode

2021-06-06T15:54:37.704Z [TRACE] core: found new active node information, refreshing

2021-06-06T15:54:37.735Z [DEBUG] core: parsing information for new active node: active_cluster_addr=https://xx.x.x.xxx:8201 active_redirect_addr=https://xx.x.x.xxx:8200

2021-06-06T15:54:37.735Z [DEBUG] core: refreshing forwarding connection

2021-06-06T15:54:37.735Z [DEBUG] core: clearing forwarding clients

2021-06-06T15:54:37.735Z [DEBUG] core: done clearing forwarding clients

2021-06-06T15:54:37.736Z [DEBUG] core: done refreshing forwarding connection

2021-06-06T15:54:37.737Z [DEBUG] core.cluster-listener: creating rpc dialer: alpn=req_fw_sb-act_v1 host=fw-d16ff282-1ec2-b858-4f46-5997964dbda4

2021-06-06T15:54:37.956Z [DEBUG] core.cluster-listener: performing client cert lookup

2021-06-06T16:10:26.696Z [ERROR] core: failed to acquire lock: error=“read tcp 192.168.253.11:40302->192.168.5.222:5432: read: connection reset by peer”