We are running vault cluster(3 nodes in Oracle Cloud Infrastructure) behind Load balancer (oracle load balancer).
We are seeing timeouts happening when trying to connect to UI and today we did some load testing to see what’s happening and observed few things.
-
Load Balancer check looks like failed and failed over to a different node(which was not leader)
-
After 7 mins, LB failed back to original node(which was leader) and during this period we saw multiple connection timeouts.
Can someone help with this issue? if there is LB tuning that needs to be done or is it because LB has some limitations, as I see from vault documentation.
We are using Oracle Object storage as our backend.
$cat /etc/vault.d/vault.hcl
cluster_name = "vault"
default_lease_ttl = "5m"
telemetry {
prometheus_retention_time = "24h"
disable_hostname = true
}
listener "tcp" {
address = "0.0.0.0:8200"
#tls_cert_file = "/etc/vault.d/vault.crt"
tls_disable = "true"
#tls_key_file = "/etc/vault.d/key.pem"
telemetry {
unauthenticated_metrics_access = true
}
}
log_level = "INFO"
max_lease_ttl = "30m"
seal "ocikms" {
auth_type_api_key = "false"
crypto_endpoint = "<endpoint>"
key_id = "<OCID>"
management_endpoint = "<mgmt endpoint>"
}
storage "oci" {
auth_type_api_key = "false"
bucket_name = "vault"
ha_enabled = "true"
lock_bucket_name = "vault_lock"
namespace_name = "<namespace>"
redirect_addr = "https://<vault_lb_url>:8200"
api_addr = "https://<vault_lb_url>:8200"
cluster_addr = "http://<vault_lb_IP>:8201"
}
ui = "true"
LB health check is hitting this with http protocol and port 8200 for status code 200
/v1/sys/health
Any help is highly appreciated on how to fix this?