Nomad client_auto_join is selecting the incorrect datacenter

We currently have clusters in 2 physical locations, “hq2” and “centralus”. We have Consul servers configured in both with WAN federation, as well as Nomad configured in both with multi-region federation.

In Consul, the DCs are hq2 and centralus.
In Nomad, the regions are hq2 and centralus, and the datacenters have the same names.

We want to upgrade our entire centralus setup, so we built a new Consul DC called centralus-2 and a new Nomad DC called centralus-2, leaving it in the centralus region.

Consul starts fine and shows hq2, centralus, and centralus-2 DCs. On the Nomad side we left the servers offline and attempted to add a client node with client_auto_join. The region was set to centralus, DC to centralus-2. When it started, it ignored the Nomad servers in the centralus region and attempted to connect to the hq2 servers, which resulted in TLS failures because the cert name didnt match expectations. If i changed the Consul client on this new node to be in the centralus DC, instead of centralus-2 DC, the Nomad auto-join succeeds.

My understanding is that Nomad clients are supposed to first attempt to connect to Nomad servers in the same Nomad region, but based on the above scenario, it seems like it’s more based on the Consul DC. The newly configured centralus-2 Consul DC has no nomad services registered, so it seems like the first Nomad client in the Nomad centralus-2 DC is reaching out to other Consul DCs and is picking up the incorrect Nomad region as a result.

Does anyone have an idea what is going on here?

Maybe try to disable the Consul server auto join to prevent it from getting confused.

I did just figure this out yesterday. I feel like Nomad should be a bit smarter about it, but what appears to be happening is that Nomad is searching consul for the server_service_name, picking one, and attempting to join. I wasn’t setting server_service_name on any of the nodes, so they were all registering under ‘nomad’ and it was grabbing a service entry from the incorrect datacenter.

The solution was to define a different server_service_name for each Nomad region, which allows auto-join to work as expected.

1 Like