Nomad client discovering incorrect IP address for server

I have two machines I’d like to connect as a Nomad cluster.

  • Both machines are connected to a Tailscale VPN. I would like to use the Tailscale interface for all interactions between the nodes, because I run services on them that I want to be accessible remotely via the VPN.
  • Both machines are running Consul, and have discovered each other. Machine A is configured with server=true and bootstrap=true, Machine B is not. As far as I can tell, Consul is working just fine.
  • Both nodes are running Nomad. Machine A is configured as both a server and a client, and Machine B is configured only as a client.

From what I’ve read this should be enough for the Nomad instances to discover each other and connect in a cluster with one server and two clients (with one machine acting in both roles).

However, when Machine B’s Nomad instance tries to connect to Machine A, I see:

client.server_mgr: no servers available
client: registration waiting on servers
client.consul: bootstrap contacting Consul DCs: consul_dcs=["dc1"]
client: error discovering nomad servers:
error=
| 1 error occurred:
|         * address 192.168.1.129: missing port in address

The interesting thing here is that that IP address is the correct LAN address for machine A, not the address for that machine on the Tailscale VPN. So the right machine is being discovered via Consul, but the wrong address is being used to try to connect to it.

In the Consul UI, all addresses shown are in the Tailscale VPN IP range.

Does anyone know why Nomad might use a different address for a machine than the one advertised by Consul?

To be clear - both nodes are configured via NixOS with:

bind_addr = "{{ GetInterfaceIP \"tailscale0\" }}";
advertise = {
  http = "{{ GetInterfaceIP \"tailscale0\" }}";
  rpc = "{{ GetInterfaceIP \"tailscale0\" }}";
  serf = "{{ GetInterfaceIP \"tailscale0\" }}";
};   
addresses = {
  http = "{{ GetInterfaceIP \"tailscale0\" }}";
  rpc = "{{ GetInterfaceIP \"tailscale0\" }}";
  serf = "{{ GetInterfaceIP \"tailscale0\" }}";
}

Hi @alaroldai, can you post the complete Nomad agent configs? There are a few fields that can play into how Nomad discovers other agents. Also, are you starting the agents with the -dev flag (which imposes certain defaults).

Sure - they both share this:


"addresses" = {
"http" = "{{ GetInterfaceIP \"tailscale0\" }}"

"rpc" = "{{ GetInterfaceIP \"tailscale0\" }}"

"serf" = "{{ GetInterfaceIP \"tailscale0\" }}"
}

"advertise" = {
"http" = "{{ GetInterfaceIP \"tailscale0\" }}"

"rpc" = "{{ GetInterfaceIP \"tailscale0\" }}"

"serf" = "{{ GetInterfaceIP \"tailscale0\" }}"
}

"bind_addr" = "{{ GetInterfaceIP \"tailscale0\" }}"

"client" = {
"enabled" = true

"host_network" "tailscale0" {
"interface" = "tailscale0"
}

"network_interface" = "tailscale0"
}

"data_dir" = "/var/lib/nomad"

"log_level" = "trace"

"plugin" "docker" "config" {
"allow_privileged" = true
}

"telemetry" = {
"collection_interval" = "1s"

"disable_hostname" = true

"prometheus_metrics" = true

"publish_allocation_metrics" = true

"publish_node_metrics" = true
}

Machine A (which should be the server) also has:


"server" = {
"bootstrap_expect" = 1

"enabled" = true
}

Machine A combines that with several host volume declarations, which I expect aren’t relevant.

Neither agent is being run as -dev. The Nomad version is 1.3.7, and the Consul version is 1.14.0, in case those are relevant.

@alaroldai did you solve your problem ?