Minimal HA cluster for Nomad

What is the minimum number of nodes needed to run a Nomad cluster in HA with Consul and Vault with two Nomad worker nodes?

The reference architecture for Nomad points out the following setup:

My goal is to have a system where one server can go down and all other services continue to be running on a second server. A small downtime window (5 minutes) would be acceptable as long as no manual intervention is required to get the workloads running again. So ideally I would need only two servers. From what I understand however, consul, nomad and vault require a “control plane” with at least 3 servers to guarantee failover. Would be a 3 node cluster be possible to handle nomad workloads by making these servers also Nomad clients like in scenario 1 (similar to running in -dev mode for a single server)?

In general, I am now wondering whether I can run Consul and Nomad Clusters on the same nodes to save computing resources for a small cluster: Do Nomad and Consul server nodes need their dedicated servers (as seen in the reference architecture) or can I run them on the same servers? If running on the same servers is ok, do I need to run Consul clients on them as well for Nomad to work (is this possible at all?) or is it ok to only run Consul agents in server mode on these? The setup would then look like this:

Looking forward to the simplest possible HA setup :slight_smile:

1 Like

Hi @davosian1! The reference architecture is what you need if you’ve got a “serious” production cluster, and if you were to become a Nomad Enterprise customer our support folks will definitely push hard for this.

If you run the Consul+Vault servers on the same hosts as the Nomad servers, you’re likely to see a lot of performance-related glitches. For example, if you run a Nomad job that registers a service in Consul and needs a dynamic secret from Vault, that’s going to create a spike in disk IO because all 3 will be writing to disk in a very short window. A spike in CPU or disk IO can cause delays in applying raft messages, which can in turn lead to raft timeouts and leadership election flapping. So I would not recommend this for most folks, especially because unfortunately a lot organizations have poor monitoring.

That being said…

  • if you have good monitoring of your hosts’ CPU, memory, and disk, and
  • if you’re not trying to run on “burstable” VMs that get throttled, and
  • if your cluster isn’t too busy

… then you may be able to get away with the Scenario 2 topology. I’ve personally done that for small production clusters (say ~10 client nodes and ~30-40 allocations) without too much trouble. Just recognize that it’s not the “happy path” and if you end up having to open a GitHub issue to report mysterious performance problems that’ll be what we’ll tell you :grinning:

As an aside, the Nomad team is working on putting together some more prescriptive advice for small clusters (or even standalone single-node “clusters”), so having examples from folks where this works out would be great. Let us know how this goes!

1 Like

Oh, and I probably wouldn’t recommend Scenario 1 unless you have a very small workload. The Nomad, Consul, and Vault servers use enough resources that you won’t have a lot of room left on the host to run the actual workloads. You could go with a beefier instance maybe but in that case, it’d probably be simpler to split out the client nodes anyways. There are a couple of other advantages of Scenario 2 over Scenario 1

  • You don’t need to run the Nomad servers as root
  • You don’t need to fiddle with the client reserved configuration to account for the servers.

Hi @tgross,

I am expecting loads in the range of 20-30 allocations with only litte traffic for this cluster, so I will follow your advice and test with scenario 2 (3 server nodes running vault, consul and nomad simultaneously and 2 client nodes with nomad and consul clients). The setup would also be flexible enough to scale up if the need arises. To see where things are heading, I am planning on setting up grafana and alert manager.

Thanks for comparing and contrasting the different options. :+1: