We currently have clusters in 2 physical locations, “hq2” and “centralus”. We have Consul servers configured in both with WAN federation, as well as Nomad configured in both with multi-region federation.
In Consul, the DCs are hq2 and centralus.
In Nomad, the regions are hq2 and centralus, and the datacenters have the same names.
We want to upgrade our entire centralus setup, so we built a new Consul DC called centralus-2 and a new Nomad DC called centralus-2, leaving it in the centralus region.
Consul starts fine and shows hq2, centralus, and centralus-2 DCs. On the Nomad side we left the servers offline and attempted to add a client node with client_auto_join. The region was set to centralus, DC to centralus-2. When it started, it ignored the Nomad servers in the centralus region and attempted to connect to the hq2 servers, which resulted in TLS failures because the cert name didnt match expectations. If i changed the Consul client on this new node to be in the centralus DC, instead of centralus-2 DC, the Nomad auto-join succeeds.
My understanding is that Nomad clients are supposed to first attempt to connect to Nomad servers in the same Nomad region, but based on the above scenario, it seems like it’s more based on the Consul DC. The newly configured centralus-2 Consul DC has no nomad services registered, so it seems like the first Nomad client in the Nomad centralus-2 DC is reaching out to other Consul DCs and is picking up the incorrect Nomad region as a result.
Does anyone have an idea what is going on here?