Can't use fully qualified domain name on manual clustering?

I set up three servers in different name servers, each with their own FQDN.

I couldn’t get the cluster to finish it’s vote until I moved to using IP addresses in the advertise{} and retry_join parameters.

It looked like somehow one of the domain names was translating back to a loop-back address (127…1) instead of its internet IP in both the server and its peers logs. This seems like a typo-level bug- resolving a domain name is relatively easy.

But maybe this behavior is expected?

I can absolutely definitely reproduce and provide whatever logs you want, if you want.

Hi @ayjayt! It would be helpful if we could see what the Nomad configuration files look like for this.

Okay @tgross

Example one

log_level = "DEBUG"
data_dir = "/tmp/nomadserver"

datacenter = "linode"
leave_on_terminate = true

server {
	enabled = true
	bootstrap_expect = 3
	server_join {
		retry_join = ["example.com:4648"]
	}
}

advertise {
	http = "example.com:4646"
	rpc = "example.com:4647"
	serf = "example.com:4648"
}

Example two- it’s 99% same, but advertise iterates.

log_level = "DEBUG"
data_dir = "/tmp/nomadserver"

datacenter = "aws"
leave_on_terminate = true

server {
	enabled = true
	bootstrap_expect = 3
	server_join {
		retry_join = ["example.com:4648"]
	}
}

advertise {
	http = "example2.com:4646"
	rpc = "example2.com:4647"
	serf = "example2.com:4648"
}

*Example 3 is obvious, just advertise is example3.com

In example1, you’ve got the retry_join and the advertise addresses the same. Is that a redaction error? It’d probably work anyways because the other two servers will reach out to it, but just wanted to clear that up.

These seems like this should be working. Do these CNAMEs resolve to the public-facing IP addresses on all 3 nodes? That is, none of them have a split-horizon DNS where they might be getting a local address and then are advertising that via serf to the other nodes?

example1 is configured as written. It does work. I’m using this as a proof of concept and will migrate to consul when there’s real work to be done. the examples on the docs for manual clustering and retry_join didn’t suggest if you should list all servers or what you should put specifically for the first boot-strapped server- but it works as is.

The domains have no CNAME because they don’t redirect, they all have A records resolving directly to their IP. I just double checked that.

The ifconfig on each machine is a bit different, most have local IPs, one doesn’t. Most have a network interface for their static IP, one doesn’t. I wouldn’t necessarily expect that to affect how a domain resolves but technically it could.

I’m guessing you’ll want logs. I’m going to feel bad if this is my error or I can’t reproduce.

I agree with you there, but this is definitely an odd situation! What I’d be looking out for here is things like dnsmasq on one host that might be configured to be “smart” and resolving the FQDN for the host to a localhost IP, and then Nomad doing something “smart” and resolving the names to IPs before sending them into the serf memberlist.

I’m guessing you’ll want logs.

That would be great!

I’m going to feel bad if this is my error or I can’t reproduce.

No worries, even if we can’t reproduce this points to either documentation issues or perhaps an edge case we weren’t considering.

This issue is moved to tech debt at the moment but I will come back to it once I have time for housekeeping