"Datacenter" on the WAN. Am I subverting Consul?

voiprodrigo · September 10, 2020, 1:04am

Hello,

These are very early steps for me with Consul, mind.

So, I’ve spun up a ““datacenter”” cluster of 5 server agents. I’ve disabled serf_wan port (-1) on purpose, and I’m using retry_join. All the agents have regular/public IPv4 addressing (although not Internet routable), and they span 3 different continents with latencies ranging from 40 to 220ms between them, depending on the source and destination locations.

They join and the UI shows them as expected. I added a key to the KV store, and all nodes respond with the value. I bring two nodes down, and the DNS service records update accordingly. A Vault cluster is also being monitored. I have no standalone agents yet though, but so far, it looks good!

But, am I subverting Consul?! I’m asking this because all the documentation is telling me that this shouldn’t happen, that all the nodes should live in the same subnet, and that at most WAN federation can be used via Serf Wan, which is able to sync services status, but will not sync the KV store.

But so far I seem to be getting away with it, so I wonder, what kind of internals in Consul would actually work against this kind of deployment of a “datacenter” in a global WAN environment? What could go wrong here?

Your considerations would be very much appreciated.

Thanks!

DerekStrickland · September 10, 2020, 10:28am

One of the main reasons that isn’t recommended is latency which can lead to leadership churn. Full details can be viewed at this link. Are you using Consul Enterprise or Open Source?

For ease of reference, his seems like the most relevant section from that link for your question.

The value of raft_multiplier is a scaling factor and directly affects the following parameters:

Param	Value
HeartbeatTimeout	1000ms	default
ElectionTimeout	1000ms	default
LeaderLeaseTimeout	500ms	default

By default, Consul uses a scaling factor of 5 (i.e. raft_multiplier: 5 ), which results in the following values:

Param	Value	Calculation
HeartbeatTimeout	5000ms	5 x 1000ms
ElectionTimeout	5000ms	5 x 1000ms
LeaderLeaseTimeout	2500ms	5 x 500ms

NOTE Wide networks with more latency will perform better with larger values of raft_multiplier .

The trade off is between leader stability and time to recover from an actual leader failure. A short multiplier minimizes failure detection and election time but may be triggered frequently in high latency situations. This can cause constant leadership churn and associated unavailability. A high multiplier reduces the chances that spurious failures will cause leadership churn but it does this at the expense of taking longer to detect real failures and thus takes longer to restore cluster availability.

voiprodrigo · September 10, 2020, 3:01pm

Hi. Thanks for jumping in @DerekStrickland.

Using open source version. Any particular limitation of the OSS version versus Enterprise which you see as relevant for this type of scenario?

I did look into that setting, as I reviewed it also for Vault using integrated Raft, and thought the default was good enough. There may be situations where local network hiccups may cause delays beyond these timeouts, but I don’t foresee it happening in more than two locations at the same time (hence the 5 nodes). Let’s see how it behaves. I’ll also put in place monitoring for the telemetry leadership metrics.

One thing I was wondering too was if Consul has hard-coded settings based on the type of subnets of the IPv4 addressing. In other words, if it would handle things differently depending on wether the agent address is “IANA private” or not.

Regards.

DerekStrickland · September 10, 2020, 4:45pm

No limitations that I am aware of. I was just thinking that if you were running Enterprise you should check with your TAM to see if that configuration is supported. For now, I’d just monitor the leadership election, which you are already doing.

Regarding public IPs, this thread indicates it’s allowable but would require you to bind to that IP address explicitly.

voiprodrigo · September 11, 2020, 2:56pm

“It does not differentiate between public or private addresses”. Yep, that’s it

Thanks a lot Derek, you were most helpful!

Kind regards

Topic		Replies	Views
Clients and Servers on the WAN? Consul	1	409	September 11, 2020
Different values of raft_multiplier on different servers Consul	2	429	October 29, 2020
Queries on Consul replicate Consul consul	1	341	August 28, 2022
Docker Pause is causing consul remote site failure Consul	7	232	May 18, 2023
Losing heartbeat and re-election leader Consul	0	351	April 24, 2023

"Datacenter" on the WAN. Am I subverting Consul?

Related topics