Job persistence - single server cluster

gaborherman · May 14, 2020, 7:02am

Does Nomad persist jobs in a single server setup? If I gracefully stop the (single) Nomad server and restart it, will I get my jobs back?

More precisely:
I’m reading here:
“In the case of an unrecoverable server failure in a single server cluster, data loss is inevitable since data was not replicated to any other servers. This is why a single server deploy is never recommended.”

What is an unrecoverable server failure? When can it happen? If my VM reboots without gracefully stopping the server? Or in the case of an error inside the Nomad server?

Background:
My product is deployed on premise to customers, not in the cloud. Medium and large scale deployments will have multiple Nomad servers as recommended. But I need a cost-effective deployment model for my smallest customers. (No high availability needed, single server can cope with the load.) Adding multiple VMs will cost me too much.

tgross · May 14, 2020, 12:19pm

A single server will persist data to disk. Stopping a Nomad server (or client for that matter) doesn’t impact running jobs unless the client is out of touch with the server and fails a heartbeat, in which jobs will get rescheduled. (Caveat: if you’re running in -dev mode neither of these are true.)

Even in the case where Nomad crashes or the server isn’t shut down gracefully the writes to the underlying boltdb should be safely atomic (and if not, that’s a bug we’d love to know about!). But there are cases outside of Nomad’s control that could cause data loss: the host file system could become corrupted, the disk could fail, etc. In that situation, the Nomad server’s data could be damaged or deleted and there’s no way for a solo Nomad server to recover from it. With multiple servers, the impacted server can just have its state wiped out and it’ll automatically get its data re-synced from the other servers.

So a single Nomad server isn’t recommended. That being said, it sounds like you’re shipping more of an “appliance” scenario where there’s a single VM with the server+client+application, and Nomad is acting more as a supervisor? You could probably get away with this use case so long as whatever is starting Nomad can also clean up the state store if Nomad can’t start and then re-launch all the jobs.

XenoPhage · April 16, 2021, 7:55pm

This is similar to what I’m trying to do. I’m testing in a single VM at the moment that I shut down when not in use. Starting the VM back up inevitably ends up with nomad complaining that there is no cluster leader. I’m not sure what I’m doing wrong here, but it seems like the data is getting corrupted every time.

Any suggestions on how to resolve this?

shantanugadgil · April 18, 2021, 6:40pm

@XenoPhage does stopping and starting the “single server” VM change its IP address?

The “no leader” problem would occur if the IP changes.

For on premise setups, I have always used a static IP machine for the Nomad server. Rebooting it hasn’t cause lost leader problems.

Topic		Replies	Views
Beginner questions Nomad	2	646	December 2, 2021
Nomad Cluster question and Job retry Nomad	6	443	March 26, 2025
Nomad server failure caused reallocation Nomad	0	356	April 27, 2022
Understanding job restart behaviour on lost jobs Nomad	2	1198	May 12, 2022
Ephemeral servers for small cluster Nomad	7	1040	February 9, 2022

Job persistence - single server cluster

Related topics