Can't get replicas to sync - Helm + GKE + manually specified listeners and services

Trying to install Vault as an HA cluster with raft on GKE. The idea is to not expose it to the WAN, but do expose it via cross-project networking. So, I have the default services disabled via Helm server.service.enabled = false, and two manually deployed services (ClusterIP and a replacement for the vault-internal local service). I also have a self-signed cert from an internal CA.

I think my most recent problems are coming from TLS issues. Last night I settled temporarily for no replication due to the vault-1 instance still failing to auto-unseal, so server.ha.replicas here is currently set to 1. I was basically hammering at the misc. listener and cert options, so there may be some issues still. Also, the manual config section is jammed under the “raft” section because it wasn’t injecting it at all if the “config” block was placed one level up.

I’m really looking for any pointers on fundamental issues with the approach here. Namely, missing parameters allowing networking or automatic communication between replicas and leader. What I kept running into is that the vault-1 pod would not be able to automatically connect and unseal based off of the vault-0 pod, even thought vault-0 would be able to automatically boot up, read from gcpckms, and unseal (based on an existing PVC) with no intervention. vault-1 on the other hand would not automatically join the cluster (it’s unclear from the docs I read if that’s supposed to happen without intervention), and would not join on account of TLS issues if I manually issued a vault operator raft join (last error I saw was that it was attempting a TLS connection but couldn’t use a raft protocol - I lost the exact name).

Helm values.yaml:

server:
  enabled: true

  mode: ha

  extraEnvironmentVars:
    VAULT_SEAL: gcpckms
    VAULT_GCPKMS_PROJECT_ID: "MYPROJECT" 
    VAULT_GCPKMS_KEY_RING: "vault-helm-unseal-kr"  
    VAULT_GCPKMS_KEY_NAME: "vault-unseal-key"    
    VAULT_GCPKMS_LOCATION: "MYREGION"             
    VAULT_CACERT: "/etc/vault/tls/ca.crt"

  ha:
    enabled: true
    replicas: 1
    apiAddr: "https://vault-internal:8200"
    clusterAddr: "https://vault-internal:8201"

    service:
      enabled: true
      type: ClusterIP
      clusterIP: None

    raft:
      enabled: true
      setSize: 2

      config: |
        ui = true

        storage "raft" {
          path = "/vault/data"
        }

        log_level = "debug"

        seal "gcpckms" {
          project    = "MYPROJECT"
          region     = "MYREGION"
          key_ring   = "MYKEYRING"
          crypto_key   = "vault-init"
        }

        # Local listener (no TLS)
        listener "tcp" {
          address     = "127.0.0.1:8200"
          # cluster_address = "127.0.0.1:8201"
          tls_disable = 1
        }

        # Cross-project listener (with TLS)
        listener "tcp" {
          address       = "POD_IP:8200"
          cluster_address = "POD_IP:8201"  # Same for cluster

          tls_cert_file = "/etc/vault/tls/vault.crt"
          tls_key_file  = "/etc/vault/tls/vault.key"
          # tls_client_ca_cert = "/etc/vault/tls/ca.crt"

          tls_disable_client_certs = true
        }

        service_registration "kubernetes" {}

  service:
    enabled: false

  # Define volumes to mount the TLS secret
  volumes:
    - name: vault-tls
      secret:
        secretName: vault-tls

  # Mount the TLS secret into the container
  volumeMounts:
    - name: vault-tls
      mountPath: /etc/vault/tls
      readOnly: true

External networking service:

apiVersion: v1
kind: Service
metadata:
  name: vault
  namespace: default
  labels:
    app: vault
  annotations:
    cloud.google.com/load-balancer-type: Internal
    networking.gke.io/internal-load-balancer-allow-global-access: 'true'
spec:
  ports:
    - name: vault-port
      protocol: TCP
      port: 443
      targetPort: 8200
  selector:
    app.kubernetes.io/instance: vault
    app.kubernetes.io/name: vault
  clusterIP: 10.0.91.200
  clusterIPs:
    - 10.0.91.200
  type: LoadBalancer
  sessionAffinity: None
  loadBalancerIP: 10.0.96.2
  loadBalancerSourceRanges:
    - 0.0.0.0/0
  externalTrafficPolicy: Local
  healthCheckNodePort: 31610
  ipFamilies:
    - IPv4
  ipFamilyPolicy: SingleStack
  allocateLoadBalancerNodePorts: true
  internalTrafficPolicy: Cluster

Internal networking service:

apiVersion: v1
kind: Service
metadata:
  name: vault-internal
  namespace: default
  labels:
    app: vault
    component: server
spec:
  ports:
    - name: vault-port
      protocol: TCP
      port: 8200  # or whatever port your Vault pods are listening on for internal communication
      targetPort: 8200
    - name: vault-cluster-port
      protocol: TCP
      port: 8201  # or whatever port your Vault pods are listening on for internal communication
      targetPort: 8201
  selector:
    app.kubernetes.io/instance: vault  # Match this to the labels of your Vault pods
    app.kubernetes.io/name: vault
    component: server  # Make sure this label matches the Vault server pods
  clusterIP: None  # This is a headless service, which will create DNS records for the pods
  publishNotReadyAddresses: true

Recent logs from “vault-1” pod (to show the gist of the problem):

==> Vault server configuration:

Administrative Namespace:
             Api Address: https://vault-internal:8200
                     Cgo: disabled
         Cluster Address: https://vault-internal:8201
   Environment Variables: HOME, HOSTNAME, HOST_IP, KUBERNETES_PORT, KUBERNETES_PORT_443_TCP, KUBERNETES_PORT_443_TCP_ADDR, KUBERNETES_PORT_443_TCP_PORT, KUBERNETES_PORT_443_TCP_PROTO, KUBERNETES_SERVICE_HOST, KUBERNETES_SERVICE_PORT, KUBERNETES_SERVICE_PORT_HTTPS, NAME, PATH, POD_IP, PWD, SHLVL, SKIP_CHOWN, SKIP_SETCAP, VAULT_ADDR, VAULT_AGENT_INJECTOR_SVC_PORT, VAULT_AGENT_INJECTOR_SVC_PORT_443_TCP, VAULT_AGENT_INJECTOR_SVC_PORT_443_TCP_ADDR, VAULT_AGENT_INJECTOR_SVC_PORT_443_TCP_PORT, VAULT_AGENT_INJECTOR_SVC_PORT_443_TCP_PROTO, VAULT_AGENT_INJECTOR_SVC_SERVICE_HOST, VAULT_AGENT_INJECTOR_SVC_SERVICE_PORT, VAULT_AGENT_INJECTOR_SVC_SERVICE_PORT_HTTPS, VAULT_API_ADDR, VAULT_CACERT, VAULT_CLUSTER_ADDR, VAULT_GCPKMS_KEY_NAME, VAULT_GCPKMS_KEY_RING, VAULT_GCPKMS_LOCATION, VAULT_GCPKMS_PROJECT_ID, VAULT_K8S_NAMESPACE, VAULT_K8S_POD_NAME, VAULT_PORT, VAULT_PORT_443_TCP, VAULT_PORT_443_TCP_ADDR, VAULT_PORT_443_TCP_PORT, VAULT_PORT_443_TCP_PROTO, VAULT_SEAL, VAULT_SERVICE_HOST, VAULT_SERVICE_PORT, VAULT_SERVICE_PORT_VAULT_PORT, VERSION
              Go Version: go1.22.8
              Listener 1: tcp (addr: "127.0.0.1:8200", cluster address: "127.0.0.1:8201", disable_request_limiter: "false", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")
              Listener 2: tcp (addr: "10.0.94.50:8200", cluster address: "10.0.94.50:8201", disable_request_limiter: "false", max_request_duration: "1m30s", max_request_size: "33554432", tls: "enabled")
               Log Level: trace
                   Mlock: supported: true, enabled: false
           Recovery Mode: false
                 Storage: raft (HA available)
                 Version: Vault v1.18.1, built 2024-10-29T14:21:31Z
             Version Sha: f479e5c85462477c9334564bc8f69531cdb03b65

==> Vault server started! Log data will stream in below:

2025-02-22T01:12:55.700Z [INFO]  proxy environment: http_proxy="" https_proxy="" no_proxy=""
2025-02-22T01:12:55.701Z [WARN]  storage.raft.fsm: raft FSM db file has wider permissions than needed: needed=-rw------- existing=-rw-rw----
2025-02-22T01:12:55.715Z [DEBUG] storage.raft.fsm: time to open database: elapsed=14.464576ms path=/vault/data/vault.db
2025-02-22T01:12:55.732Z [DEBUG] service_registration.kubernetes: "namespace": "default"
2025-02-22T01:12:55.732Z [DEBUG] service_registration.kubernetes: "pod_name": "vault-1"
2025-02-22T01:12:56.036Z [INFO]  incrementing seal generation: generation=1
2025-02-22T01:12:56.037Z [DEBUG] core: set config: sanitized config="{\"administrative_namespace_path\":\"\",\"api_addr\":\"\",\"cache_size\":0,\"cluster_addr\":\"https://vault-internal:8201\",\"cluster_cipher_suites\":\"\",\"cluster_name\":\"\",\"default_lease_ttl\":0,\"default_max_request_duration\":0,\"detect_deadlocks\":\"\",\"disable_cache\":false,\"disable_clustering\":false,\"disable_indexing\":false,\"disable_mlock\":true,\"disable_performance_standby\":false,\"disable_printable_check\":false,\"disable_sealwrap\":false,\"disable_sentinel_trace\":false,\"enable_response_header_hostname\":false,\"enable_response_header_raft_node_id\":false,\"enable_ui\":true,\"experiments\":null,\"imprecise_lease_role_tracking\":false,\"introspection_endpoint\":false,\"listeners\":[{\"config\":{\"address\":\"127.0.0.1:8200\",\"tls_disable\":1},\"type\":\"tcp\"},{\"config\":{\"address\":\"10.0.94.50:8200\",\"tls_cert_file\":\"/etc/vault/tls/vault.crt\",\"tls_disable_client_certs\":true,\"tls_key_file\":\"/etc/vault/tls/vault.key\"},\"type\":\"tcp\"}],\"log_format\":\"\",\"log_level\":\"trace\",\"log_requests_level\":\"\",\"max_lease_ttl\":0,\"pid_file\":\"\",\"plugin_directory\":\"\",\"plugin_file_permissions\":0,\"plugin_file_uid\":0,\"plugin_tmpdir\":\"\",\"raw_storage_endpoint\":false,\"seals\":[{\"disabled\":false,\"name\":\"gcpckms\",\"type\":\"gcpckms\"}],\"service_registration\":{\"type\":\"kubernetes\"},\"storage\":{\"cluster_addr\":\"\",\"disable_clustering\":false,\"raft\":{\"max_entry_size\":\"\"},\"redirect_addr\":\"\",\"type\":\"raft\"}}"
2025-02-22T01:12:56.037Z [DEBUG] storage.cache: creating LRU cache: size=0
2025-02-22T01:12:56.037Z [INFO]  core: Initializing version history cache for core
2025-02-22T01:12:56.037Z [INFO]  events: Starting event system
2025-02-22T01:12:56.040Z [DEBUG] cluster listener addresses synthesized: cluster_addresses=[127.0.0.1:8201, 10.0.94.50:8201]
2025-02-22T01:12:56.044Z [INFO]  core: stored unseal keys supported, attempting fetch
2025-02-22T01:12:56.044Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
2025-02-22T01:12:56.111Z [DEBUG] would have sent systemd notification (systemd not present): notification=READY=1
2025-02-22T01:13:01.045Z [INFO]  core: stored unseal keys supported, attempting fetch
2025-02-22T01:13:01.045Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
2025-02-22T01:13:04.976Z [INFO]  core: security barrier not initialized
2025-02-22T01:13:04.976Z [INFO]  core.autoseal: recovery seal configuration missing, but cannot check old path as core is sealed
2025-02-22T01:13:06.046Z [INFO]  core: stored unseal keys supported, attempting fetch
2025-02-22T01:13:06.046Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
2025-02-22T01:13:09.971Z [INFO]  core: security barrier not initialized
2025-02-22T01:13:09.971Z [INFO]  core.autoseal: recovery seal configuration missing, but cannot check old path as core is sealed
2025-02-22T01:13:11.046Z [INFO]  core: stored unseal keys supported, attempting fetch
2025-02-22T01:13:11.047Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
2025-02-22T01:13:14.974Z [INFO]  core: security barrier not initialized
2025-02-22T01:13:14.974Z [INFO]  core.autoseal: recovery seal configuration missing, but cannot check old path as core is sealed
2025-02-22T01:13:16.047Z [INFO]  core: stored unseal keys supported, attempting fetch
2025-02-22T01:13:16.048Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
2025-02-22T01:13:19.970Z [INFO]  core: security barrier not initialized
2025-02-22T01:13:19.970Z [INFO]  core.autoseal: recovery seal configuration missing, but cannot check old path as core is sealed
2025-02-22T01:13:21.048Z [INFO]  core: stored unseal keys supported, attempting fetch
2025-02-22T01:13:21.048Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
2025-02-22T01:13:24.989Z [INFO]  core: security barrier not initialized
2025-02-22T01:13:24.989Z [INFO]  core.autoseal: recovery seal configuration missing, but cannot check old path as core is sealed
2025-02-22T01:13:26.049Z [INFO]  core: stored unseal keys supported, attempting fetch
2025-02-22T01:13:26.049Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
2025-02-22T01:13:29.978Z [INFO]  core: security barrier not initialized
2025-02-22T01:13:29.978Z [INFO]  core.autoseal: recovery seal configuration missing, but cannot check old path as core is sealed