Job persistence - single server cluster

Does Nomad persist jobs in a single server setup? If I gracefully stop the (single) Nomad server and restart it, will I get my jobs back?

More precisely:
I’m reading here:
“In the case of an unrecoverable server failure in a single server cluster, data loss is inevitable since data was not replicated to any other servers. This is why a single server deploy is never recommended.”

What is an unrecoverable server failure? When can it happen? If my VM reboots without gracefully stopping the server? Or in the case of an error inside the Nomad server?

Background:
My product is deployed on premise to customers, not in the cloud. Medium and large scale deployments will have multiple Nomad servers as recommended. But I need a cost-effective deployment model for my smallest customers. (No high availability needed, single server can cope with the load.) Adding multiple VMs will cost me too much.

A single server will persist data to disk. Stopping a Nomad server (or client for that matter) doesn’t impact running jobs unless the client is out of touch with the server and fails a heartbeat, in which jobs will get rescheduled. (Caveat: if you’re running in -dev mode neither of these are true.)

Even in the case where Nomad crashes or the server isn’t shut down gracefully the writes to the underlying boltdb should be safely atomic (and if not, that’s a bug we’d love to know about!). But there are cases outside of Nomad’s control that could cause data loss: the host file system could become corrupted, the disk could fail, etc. In that situation, the Nomad server’s data could be damaged or deleted and there’s no way for a solo Nomad server to recover from it. With multiple servers, the impacted server can just have its state wiped out and it’ll automatically get its data re-synced from the other servers.

So a single Nomad server isn’t recommended. That being said, it sounds like you’re shipping more of an “appliance” scenario where there’s a single VM with the server+client+application, and Nomad is acting more as a supervisor? You could probably get away with this use case so long as whatever is starting Nomad can also clean up the state store if Nomad can’t start and then re-launch all the jobs.

1 Like

This is similar to what I’m trying to do. I’m testing in a single VM at the moment that I shut down when not in use. Starting the VM back up inevitably ends up with nomad complaining that there is no cluster leader. I’m not sure what I’m doing wrong here, but it seems like the data is getting corrupted every time.

Any suggestions on how to resolve this?

@XenoPhage does stopping and starting the “single server” VM change its IP address?

The “no leader” problem would occur if the IP changes.

For on premise setups, I have always used a static IP machine for the Nomad server. Rebooting it hasn’t cause lost leader problems.

2 Likes