Is setting advertise_addr_wan required when configuring consul cluster to auto-join?

The nomad auto join demo here:

set advertise_addr_wan to the PUBLIC IP of the ec2 instance here:

I’m still learning all of this, but my best understanding of the broader configuration is that Consul uses the WAN address in this scenario because the consul/nomad servers are spanning multiple AZs and therefore subnets. So, Consul needs the public IP in order to automatically form a cluster.

I wanted to remove the Public IP from these severs because the services they all run have no need for any inbound traffic from the outside world and that seems more secure, but then (obviously) the consul cluster formation immediately fell over.

My Question:
Is this public IP required in order for Consul to form a cluster in this type of (autojoining) setup? Or, in other words, is this public IP the tradeoff that allows autojoining across AZs to work?

Sub Question
If I did away with this autojoin luxury is it possible to have Consul form a cluster across multiple AZs without the public IP somehow? Or is that always required?

Hi @josh.m.sharpe . Thanks for using Nomad!

Based on this Consul link it looks like it doesn’t have to necessarily be a Public IP.

Also, here’s an article on configuring networking for Consul WAN Federation. I’d recommend taking a look here to see if your network configuration meets all the requirements.

Guessing this will show my lack of understanding of routing, but…

That network configuration info says: “all server nodes must be able to talk to each other” - OK, I can ping the private IPs across subnets - so they can… “talk”.

That section then goes on to suggest that setting bind_addr to a private IP prevents the RPC server from accepting connections across the WAN. But, why does it attempt to connect over the WAN to begin with? Is there some way I can have it attempt to connect over the private network which is seemingly already routeable?

I’d also suggest that at this point I have no interest in configuring Consul across multiple data centers as that article seems like it was intended for.

Our architecture will exist in a single AWS region across a few availability zones in that region - I don’t think(?) that is multiple ‘datacenters’ - but please correct me if I’m wrong.

Hi @josh.m.sharpe,

Are you still working on this, or have you got it worked out? In case you are still working through this issue, here are a few things for you.

First, from the Consul glossary:

“We define a datacenter to be a networking environment that is private, low latency, and high bandwidth. This excludes communication that would traverse the public internet, but for our purposes multiple availability zones within a single EC2 region would be considered part of a single datacenter.”

From that definition, it seems that you likely don’t have multiple datacenters. Can you confirm that you only have one set of servers?

Second, this repo has some sample code and guidance that you might want to reference and compare to your configuration.

Third, do you have any logs you can share showing what you see after you remove the advertise_addr_wan? Also, you could provide your full server & client configs with secrets removed, that would be really helpful.

Fourth, what version of Consul & Nomad are you using? I ask because that retry_join_ec2 option has been deprecated for a while now. Here is a link to the current Consul on EC2 documentation. It’s probably worth looking at that and making sure all your config matches those requirements.

Finally, I would say, it is worth cross-posting this in the Consul discuss forum. With the information I have right now, it’s difficult to tell if the issue is on the Consul or Nomad side, but it’s possible the Consul folks will see an immediate fix.

Thanks,

@DerekStrickland and the Nomad Team

I did get this worked out but I had to more or less completely re-invent the networking in order to do so.

Our deployment is in a single region across multiple availability zones, much like the repo I referenced. However, it made no sense to me that any of the nomad servers or clients would exist with public IPs - security group or not - they don’t need them. Honestly, that was the fundamental issue for me. It threw me way off course seeing these definitely-should-be-internal-servers sitting there with public IPs.

nomad/consul versions are the latest as of this writing

In order to remove the public IPs I had to do at least these things:

  1. set map_public_ip_on_launch=false on the subnets where the consul/nomad servers existed
  2. Add ‘aws_nat_gateways’ to allow the nomad clients to access the internet
  3. address the consul setup scripts that referenced the public IP and eliminate that. That eliminated advertise_addr_wan

And yea, fwiw, based on what I know now, this was 100% a AWS/consul configuration issue. Not nomad.

Wow. That sounds like it was quite the adventure. Thanks for sharing though! Hopefully, someone else in the community will benefit from your efforts.

Cheers!

@DerekStrickland and the Nomad Team