Hi, this question is not really related to “large production usage” of nomad, but more directed towards exploring nomad in small, home environments for managing a few nodes, so bear with me.
The idea is simple: given a few small scale nodes (my small-scale home cluster, not very reliable hardware), I want to run a few services on them; I’d like to keep job specification files on my laptop and control the cluster from there as needed. I’d run the server on the laptop, apply changes to the configuration and let nomad apply the missing pieces. If something goes wrong, I’d be able to watch at allocation logs to inspect from the server node (laptop), but without having it running all the time.
This is mostly to cut down the requirements for managing the services of course - while there are other options (e.g. ansible), I was wondering how nomad would behave without a persistent server (I could just run the server on one of the nodes, for one).
Does it make sense, given this use case? Can nomad handle that?
Cheers
As far as I am aware you always need to have a server running for the cluster to operate (even if the cluster is just 1 node/server). Though i’m sure someone will say otherwise of this is not the case
I’ve been running Nomad in several configurations at home including a single raspi with both the client and server running on it (initial testing), this does work however you have the potential for a jobs to mess with the server itself as clients run as root whereas servers are not required to. This setup is basically the same as when you spin a dev instance up on your local machine so not recommended for a “production” workload. It also runs into the issue where the cluster stalls when doing updates and having to restart the server.
I’ve currently settled on having a few dedicated raspis to run the servers (initially had 1 but now have a 3 for servers) I only interact with them when performing updates and maintenance which is done through Ansible (makes it pretty easy to manage). Having three means that I can take a server down for updates/maintenance without bringing my cluster out of action.
From here you could then make the servers ephemeral where instead of updating the old servers you deploy a freshly built image and replace the old server image.
I believe at one point I did run 3 total machines which each ran as a server and node, this does give you the benefit of running the cluster with one of the machines down, however, you still run into the permissions issue running in the “dev” mode setup.
If your planning on running as part of a homelab I don’t see too much of an issue running the server/node on the same machine setup and clustering multiple machines together (assuming limited hardware) as long as you understand the risks that it involves regarding security and potential load on those machines.
I will specify here that I only run Nomad as part of my home lab so might not follow the recommended setup by Hashicorp so the info here is maybe not the best to take to heart if your planning a proper production workload
Hello @CarbonCollins and thank you for your kind answer. It’s great to know about you setup and knowing you could run a small fleet of servers on raspi, this makes me think that probably I’ll be able to run the server on my nodes as well.
The load won’t be much and security issues are close to zero in my case; it’s mostly a solution I was considering management convenience. But I’ll definitely consider other options as well if this setup is not recommended
I will reiterate that running a server and node on the same machine is not a recommended setup you basically are relying on what ever job you have running does not breach containment, this is assuming that your job driver even has containment so it does have that inherent risk as now they can potentially have full direct access to your servers (and machine for that matter) which manages the cluster as a whole.
I would still recommend separate servers and nodes if you have the hardware available to do so.
I’ve got some good news for you I hope . We are currently working on some improvements around running edge workloads on Nomad, and one of the features is the ability to configure clients and/or task groups to tolerate long periods of connection loss from the servers.
The way Nomad works today is that tasks will continue to run on the disconnected client until the client reconnects unless you have configured them to stop. However, when the client reconnects, its tasks get restarted. This isn’t ideal if you need/want zero downtime, but might already be acceptable for your use case.
In 1.3 we are adding the ability for tasks to resume without a restart once the client reconnects. Once this feature lands, you’ll be able to configure a duration beyond heartbeat_grace during which the client will transition to disconnected rather than down at the server, and the allocations will be marked unknown rather than lost. If the client reconnects before that duration expires, its allocations will attempt to resume rather than restart, assuming they haven’t failed while disconnected. You will also be able to configure them with an “always resume” setting in case you don’t want them to ever expire.
I am hopeful that between the two operating modes, you’ll be able to get where you want to go.