Troubles running Vault HA in Nomad: "can't communicate with the active node..."

Hi
I am quite know to vault and I looked through the docs several hours, but can’t get this working as exptected.

I would like to run vault as a workload in nomad with a consul backend. (I already use nomad and consul). So I created two identical job files for vault-a and vault-b (see below).

The nomad job service vault-a ( and vault-b) gets registered in consul by nomad and traefik is used to HTTPs offloading and proxing. That works as expected, vault-a and vault-b is reachable on https://vault-a.apps.example.com resp. https://vault-b.apps.example.com

variable "datacenters" {
  type = list(string)
  default = ["dc1"]
}
variable "namespace" {
  type = string
  default = "default"
}
variable "host_network" {
  type = string
  default = ""
}
job "vault-a" {
  datacenters = var.datacenters
  namespace = var.namespace

  type = "service"

  group "vault" {
    count = 1

    network {
      mode = "host"
      port "tcp" {
        host_network = var.host_network
        static = 25081
      }
      port "cluster" {
        host_network = var.host_network
        static = 25082
      }
    }

    task "vault" {
      template {
        change_mode = "restart"
        destination = "local/config.hcl"
        data = <<EOH
ui = true
cluster_name = "my-cluster"

storage "consul" {
  address = "172.17.0.1:8500"
  path = "vault/"
}

service_registration "consul" {
  address = "172.17.0.1:8500"
}

listener "tcp" {
  address = "[::]:{{ env "NOMAD_PORT_tcp" }}"
  cluster_address  = "[::]:{{ env "NOMAD_PORT_cluster" }}"
  tls_disable = 1
}

api_addr = "https://vault-a.apps.example.com:443"
EOH
      }

      driver = "docker"

      config {
        image = "vault:1.9.2"
        # cap_add = ["IPC_LOCK"]
        privileged = true
        volumes = [
          "local/config.hcl:/etc/vault/config.hcl",
        ]
        args = [
          "server",
          "-config", "/etc/vault",
        ]

        ports = [
          "tcp",
          "cluster",
        ]
      }

      service {
        name = "vault-a"
        tags = [
    "traefik.enable=true",
    "traefik.http.routers.vault-a.rule=HostRegexp(`vault-a.{domain:.*}`)",
    "traefik.http.routers.vault-a.middlewares=vault-a-https",
  
    "traefik.http.middlewares.vault-a-https.redirectscheme.scheme=https",
  
    "traefik.http.routers.vault-a-https.rule=HostRegexp(`vault-a.{domain:.*}`)",
    "traefik.http.routers.vault-a-https.tls=true",
  ]
        port = "tcp"
      }
    }
  }
}

The issue I see is on the standby vault node: “This is a standby Vault node but can’t communicate with the active node via request forwarding. Sign in at the active node to use the Vault UI.” and I didn’t see what I am doing wrong. I even switched to static identical ports on both jobs. I guess the question is, how does vault find the other node? Because we already use consul, my guess would be consul and I see this in consul:

Logs:

==> Vault server configuration:

             Api Address: https://vault-a.apps.example.com:443
                     Cgo: disabled
         Cluster Address: https://vault-a.apps.example.com:444
              Go Version: go1.17.5
              Listener 1: tcp (addr: "[::]:25081", cluster address: "192.168.2.151:25082", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")
               Log Level: info
                   Mlock: supported: true, enabled: true
           Recovery Mode: false
                 Storage: consul (HA available)
                 Version: Vault v1.9.2
             Version Sha: f4c6d873e2767c0d6853b5d9ffc77b0d297bfbdf

==> Vault server started! Log data will stream in below:

I noted the cluster address to lok different with port 444?

Could you give me a hint?

Thanks in advance

Would you be able to provide your full Vault and Consul configs (redacting sensitive bits where necessary)? I’ve never worked with Nomad so I’m not sure what’s getting inherited where.

By default, the cluster port will be your API port number +1. So if you change your API port to 443 the cluster port will be 444.

Here’s an example Vault config (HTTPS and related components still need to be added)

cluster_name = "my_cluster.vault"
api_addr     = "https://192.168.1.20:8200"
cluster_addr = "https://192.168.1.20:8201"
ui           = true

storage "consul" {
  address = "server.vault.consul:8501" # This should match what's in your Consul Agent config, this side needs to include the Consul port
}

listener "tcp" {
  address         = "0.0.0.0:8200"
  cluster_address = "192.168.1.20:8201"
}

Also note that you need to configure Consul agent on your Vault nodes. Something along these lines should work (you may need to account for HTTPS, certs, etc).

server = false
retry_join = [] # Add your retry_join criteria here; should point to your Consul storage nodes

ports = {
  http     = -1
  https    = 8501
  serf_lan = 8301
  server   = 8300
  dns      = 8600
  serf_wan = -1
}

server_name = "server.vault.consul" # This should match what's in your Vault config
datacenter  = "vault"
data_dir    = "/opt/consul/data"
bind_addr   = "0.0.0.0"
client_addr = "0.0.0.0"

Make sure your Vault nodes can communicate with each other over your selected ports (API and Cluster ports - should allow ingress and egress). Likewise your Vault nodes must be able to communicate with your Consul nodes on the desired ports as well.

I’m not entirely sure how the Vault nodes find each other, but I suspect it’s either related to the cluster_name parameter or being joined to a common storage backend. I’d love to learn more about that as well.