Vault K8s HA Raft Certificate Error

Hello guys,

im trying to setup Vault HA on Kubernetes with Raft. I use an AKS Cluster. I follow this guide:

Vault on Kubernetes Deployment Guide | Vault - HashiCorp Learn

I also use Azure Keyvault. Ever since im trying to configure TLS im running into issues.

In particular the retry_join is not working for me. Log except of one of the nodes that tries to join:

2021-01-18T18:27:58.317Z [INFO] core: attempting to join possible raft leader node: leader_addr=https://vault-0.vault-internal:8200

2021-01-18T18:27:58.539Z [WARN] core: join attempt failed: error=“error during raft bootstrap init call: Put “https://10.244.0.99:8200/v1/sys/storage/raft/bootstrap/challenge”: x509: certificate is valid for MYPUBLICIP, not 10.244.0.99”

section of my raft storage configuration:

listener "tcp" {
      tls_disable = 0
      address = "[::]:8200"
      cluster_address = "[::]:8201"
      tls_ca_cert_file = "/vault/userconfig/vault-server-tls/root-ca.pem"
      tls_cert_file   = "/vault/userconfig/vault-server-tls/vault.crt"
      tls_key_file    = "/vault/userconfig/vault-server-tls/vault.key"
    }

    storage "raft" {
      path = "/vault/data"
      retry_join {
        leader_api_addr = "https://vault-0.vault-internal:8200"
        leader_ca_cert_file = "/vault/userconfig/vault-server-tls/root-ca.pem"
        leader_client_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
        leader_client_key_file = "/vault/userconfig/vault-server-tls/vault.key"
      }

      retry_join {
        leader_api_addr = "https://vault-1.vault-internal:8200"
        leader_ca_cert_file = "/vault/userconfig/vault-server-tls/root-ca.pem"
        leader_client_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
        leader_client_key_file = "/vault/userconfig/vault-server-tls/vault.key"
      }

      retry_join {
        leader_api_addr = "https://vault-2.vault-internal:8200"
        leader_ca_cert_file = "/vault/userconfig/vault-server-tls/root-ca.pem"
        leader_client_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
        leader_client_key_file = "/vault/userconfig/vault-server-tls/vault.key"
      }

      retry_join {
        leader_api_addr = "https://vault-3.vault-internal:8200"
        leader_ca_cert_file = "/vault/userconfig/vault-server-tls/root-ca.pem"
        leader_client_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
        leader_client_key_file = "/vault/userconfig/vault-server-tls/vault.key"
      }

      retry_join {
        leader_api_addr = "https://vault-4.vault-internal:8200"
        leader_ca_cert_file = "/vault/userconfig/vault-server-tls/root-ca.pem"
        leader_client_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
        leader_client_key_file = "/vault/userconfig/vault-server-tls/vault.key"
      }

Im really wondering why the errors mentions that the certificate doesnt contain the pod ip. The Raft Storage configuration specifies the DNS names like vault-0.vault-internal etc. I certainly cannot add the POD Ips to the certificate as I don’t know them before Vault is deployed.

If anyone has any idea what might be causing this I would appreciate it. I will update this topic if I figure it out myself.

I kinda made some progress on that matter:
vault_raft

I changed the way I use the Helm chart from using only the values.yaml and added an overwrite yaml just the way to tutorial advised. The error I received may be in connection with the probe settings that I added to my override yaml file:

  # For HA configuration and because we need to manually init the vault,
  # we need to define custom readiness/liveness Probe settings
  readinessProbe:
    enabled: true
    path: "/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204"
  livenessProbe:
    enabled: true
    path: "/v1/sys/health?standbyok=true"
    initialDelaySeconds: 60

Not sure if this really was the issue but it is working for me now. If you encounter similar problems feel free to reply to this thread. Maybe we can pinpoint the error down.

Last update on this matter. I dont think that the readiness/liveness proves had anything to with it. The behavior is somewhat inconsistent. Sometimes the cluster initializes fine and the retry join works as expected. Sometimes I get the error I mentioned with opening this thread. Only workaround for me is re-deploying until it works.

Still cannot determine the root cause in my setup for this behavior.

1 Like