Two server failover using nomad

I am just learning about orchestrators and it’s my first time using any orchestrator (Nomad in this case).

Our software (video stream processing related) is installed on user premises, currently we have been using docker compose to deploy our software (bunch of services as docker container).

Now we have a client who needs failover and has provided us two physical server, one is to be used as a main server and other as failover. What we’d like to do is have our services installed on both the servers but only run these services on failover when the main server goes down.

We won’t like to make any changes to our code. we have one services which process the video streams and raise events (VA service) and another service which handles database, users, events etc. (Client Service). The client service knows about this VA service and it assigns work to it. We don’t want these services to be running on failover server when the main server is up since it will cause streams to be processed twice i.e produce duplicate events.

We are configuring our database in master-master mode so that data is available on failover when main fails ( trying it on postgres (current database), had previously done it on Mysql ).

I have set up nomad on both machines with nomad server and client both on each machine and tried configuring them such that if the machine with nomad leader server and its client shuts down, nomad server on failover should take charge and run tasks on its client, but i am unable to achieve this. If i shutdown the main server nothing happens on failover machine.

# Leader config file
datacenter = "dc1"
data_dir   = "/opt/nomad/data"
bind_addr  = "192.168.1.38"
name = "server1"

server {
  enabled          = true
  bootstrap_expect = 2
  server_join {
    retry_join = [
      "192.168.1.38",
      "192.168.1.20",
    ]
    retry_max      = 0
    retry_interval = "15s"
  }
}

client {
  enabled = true
  servers = [
    "192.168.1.38",
    "192.168.1.20",
  ]
}


# Follower config file
datacenter = "dc1"
data_dir   = "/opt/nomad/data"
bind_addr  = "192.168.1.20"
name = "server2"

server {
  enabled          = true
  bootstrap_expect = 2
  server_join {
    retry_join = [
      "192.168.1.38",
      "192.168.1.20",
    ]
    retry_max      = 0
    retry_interval = "15s"
  }
}

client {
  enabled = true
  servers = [
    "192.168.1.38",
    "192.168.1.20",
  ]
}
  1. Is nomad even the right tool for this situation ?
  2. can i achieve what i am trying using nomad ?

Great question, I would like to accomplish the same so am very interested in the answer… Quite disappointing that there was no response…

I think the main issue with the configuration of the OP is that he’s just using 2 nodes, but quorum requires an odd number of nodes.
I would say in his case if one node goes down, no quorum can be reached and the remaining node goes into safe mode.
Solution: Have another Nomad server node

On top of that, I would recommend to facilitate Consul and Consul Connect for a HA setup.

Did something similar in my home setup, which can be found here:

Full automatic failover if one node goes down.

Thanks for the response, @matthias!

Ahh, I see:

I have set up nomad on both machines with nomad server and client both on each machine

So you’re saying if the two machines were only setup as nomad nodes and the nomad server (single or cluster) was running elsewhere, the failover would work as expected but it didn’t because taking out one of two servers causes a split-brain at the nomad server level?

Exactly. If you want resiliency, you need at least three server nodes. If one node goes down, the cluster still stays up.

If you have one server and three nodes, shouldn’t that still work though? Obviously the one server is a SPOF but losing one node I would expect the server to still be able to reschedule services to the remaining two.

Totally fine as long as the three nodes are running on their own VMs.
In that case, you could shut down one node (i.e., to update the OS of the VM) and the cluster will stay up.