Issues joining raft storage when using -tls-server-name

I am having an issue where I can’t get nodes to join the raft when setting the -tls-server-name flag.

We are trying to use a wildcard cert form lets encrypt.

I have tried setting leader_tls_servername in
env vars with extraEnvironmentVars: in the values.yaml
in the retry_join stanza
and at the command line

I receive failures as if the flag wasn’t set.

vault operator raft join \
   -tls-server-name=*.mgmt-vault.example.com \
    https://vault-0.vault-internal:8200

core: failed to get raft challenge: leader_addr= https://vault-0.vault-internal:8200 error="error during raft bootstrap init call: Put \" https://vault-0.vault-internal:8200/v1/sys/storage/raft/bootstrap/challenge\": x509: certificate is valid for *.mgmt-vault.example.com, not vault-0.vault-internal"

Our helm values file is here.
Thank for your time

# Vault Helm Chart Value Overrides
global:
  enabled: true
  tlsDisable: false

injector:
  enabled: true

server:
  image:
    repository: "hashicorp/vault"
    tag: "latest"
  readinessProbe:
    enabled: true
    port: 8200
    failureThreshold: 2
    initialDelaySeconds: 5
    periodSeconds: 5
    successThreshold: 1
    timeoutSeconds: 3
  livenessProbe:
    enabled: false
    path: "/v1/sys/health?standbyok=true"
    port: 8200
    failureThreshold: 2
    initialDelaySeconds: 60
    periodSeconds: 5
    successThreshold: 1
    timeoutSeconds: 3
  extraEnvironmentVars:
    VAULT_TLS_SERVER_NAME: "*.mgmt-vault.example.com"
    VAULT_ADDR: "https://localhost:8200"
  volumes:
    - name: userconfig-mgmt-vault-tls
      secret:
        defaultMode: 420
        secretName: mgmt-vault-tls
  volumeMounts:
    - mountPath: /vault/userconfig/mgmt-vault-tls
      name: userconfig-mgmt-vault-tls
      readOnly: true
  auditStorage:
    enabled: false
  certs:
    secretName: mgmt-vault-tls
  standalone:
    enabled: false
  service:
      enabled: true
      active:
        enabled: true
      standby:
        enabled: true
      instanceSelector:
        enabled: true
      publishNotReadyAddresses: true
      externalTrafficPolicy: Local
      port: 8200
      targetPort: 8200
      annotations: {}
  ha:
    enabled: true
    replicas: 3
    raft:
      enabled: true
      setNodeId: true
      config: |
        ui = true
        seal "awskms" {
          region     = "us-east-1"
          kms_key_id = "alias/vault-kms-unseal-hive-mgmt"
        }
        listener "tcp" {
          address = "0.0.0.0:8200"
          cluster_address = "0.0.0.0:8201"
          tls_cert_file = "/vault/userconfig/mgmt-vault-tls/tls.crt"
          tls_key_file  = "/vault/userconfig/mgmt-vault-tls/tls.key"
        }
        storage "raft" {
          path = "/vault/data"
          retry_join {
            address = "https://localhost:8200"
            leader_tls_servername = "*.mgmt-vault.example.com"
            leader_api_addr = "https://vault-0.vault-internal:8200"
          }
          retry_join {
            address = "https://localhost:8200"
            leader_tls_servername = "*.mgmt-vault.example.com"
            leader_api_addr = "https://vault-1.vault-internal:8200"
          }
          retry_join {
            address = "https://localhost:8200"
            leader_tls_servername = "*.mgmt-vault.example.com"
            leader_api_addr = "https://vault-2.vault-internal:8200"
          }
        }
        disable_mlock = true
        service_registration "kubernetes" {}
  serviceAccount:
    create: false
    name: "vault-kms-iam-role"
    serviceDiscovery:
      enabled: true
# Vault UI
ui:
  enabled: true
  serviceType: "LoadBalancer"
  serviceNodePort: null
  externalPort: 8200
  externalTrafficPolicy: Local
  activeVaultPodOnly: true

  # For Added Security, edit the below
  #loadBalancerSourceRanges:
  #   - < Your IP RANGE Ex. 10.0.0.0/16 >
  #   - < YOUR SINGLE IP Ex. 1.78.23.3/32 >

You have not specified the version of Vault in use - there are changelog entries mentioning bug fixes related to leader_tls_servername in the past.

This seems wrong, leader_tls_servername is not a valid environment variable name for Vault. But actually, in your values.yaml content that you shared below, that’s not actually what you set, so maybe that’s not a problem.

This would not work, as the flag is specifying the TLS server name the Vault CLI should expect when sending the instruction to join to the new Vault server - and not the TLS server name the new Vault server should expect when it reaches out to find a leader.

It is not related to your problem, but I most strongly advise this is a bad configuration - you are exposing yourself to unplanned upgrades to arbitrary newer Vault versions in the future. You absolutely must not use latest here.

I am uncertain whether this environment variable, mostly used by Vault client code, would affect server-to-server communication.

address is not a valid key to have in a retry_join block.

Thanks for taking the time

To your points:
We are testing this on 1.13.1/latest. I have made the suggested change from latest to exact version. The other examples of misconfig have been cleaned up.
Thank you for better explaining those.

  extraEnvironmentVars:
    VAULT_TLS_SERVER_NAME: "*.mgmt-vault.example.com"

This does work for the client and I had added it in hopes it would also work server side. It didn’t.

What you explain here is what I am experiencing. My question is, how do I set the TLS server name the new Vault server should expect when it reaches out to find a leader?

I had initially assumed it was leader_tls_servername = "*.mgmt-vault.example.com" in the retry_join section.

I with the retry_join sections I expect to not need to run the join command manually.

It does look like the configuration in your retry_join blocks is correct to me, so it is weird that it is not taking effect.

It might even be worth opening a GitHub issue to report that, as it seems like it might be a bug.

If you were to need to trigger the join not via the configuration file, it appears the join HTTP API does support setting the servername - https://developer.hashicorp.com/vault/api-docs/system/storage/raft#join-a-raft-cluster but that feature has not been exposed in the vault operator raft join CLI command. You could open a separate GitHub issue about that feature gap as well, if you felt like it.