Single-node Nomad (combined server/client agent) feedback

Hi all,

The Nomad team is currently exploring how we can better support single-node Nomad deployments. These would be deployments that are running the Nomad agent as both a server and a client simultaneously.

We want to support this use-case for Nomad (with some clearly communicated caveats around HA and noisy neighbors), but we want to make sure to put the proper technical guard-rails in place when we do. This would allow for simpler and less expensive Nomad deployments for test/dev “clusters” or for very cost-conscious users in prod.

Does anybody have feedback on this sort of use? Has anybody run into issues specific to mixed-use Nomad agents? Are there any technical guardrails you would like to see to enable this?

Similarly, if anybody has run a small cluster of mixed agents (i.e. 3 nodes running both servers and clients) and has thoughts, we would be interested to hear feedback.

Please let us know!

  • Mike, Nomad Product Manager
4 Likes

Hey Mike

I have been running something like that (3 to 4 clients, all in server&client mode) for around 6 months. For personal use only, but I think I might go with Nomad as the next infrastructure solution for my cost-aware client (which wouldn’t benefit from Kubernetes and sill would like to benefit from the modern architecture approach).

Technically, I’m not a devops/sysops person, but the whole experience was pretty smooth so far. I ran into different problems related to CSI volumes, but in regards to running nodes in a multi-mode - I can’t say I ran into any issues so far. I have allocated some resources on each node only for Nomad to avoid running into weird problems (like processes suddenly OOMing) and have some systemd services in place for draining node upon shutdown/ensuring nomad reboots in case of a crash.

Aside that I keep everything really simple, rely on internal networking provided for hosting provider (so no proper isolation between services is happening, but I chose to accept that bargain) and unfortunately don’t have suggestions how to improve general experience as it was pretty great thus far overall :).

What I am worried a little bit tho is fragmentation around Nomad’s ecosystem (levant vs nomad-packs - both are kinda not production-grade ready) - but I mentioned that in another topic :see_no_evil:

2 Likes

Hey @rwojsznis appreciate the feedback and glad that mixed clients are going well for you. The processes you put in place seem like the right ones. If you haven’t already, I’d also set some reserved memory and cpu in the client blocks in the agent config - client Stanza - Agent Configuration | Nomad by HashiCorp (this might be what you were referring to though!)

Something small to note: If your combined agents are clustering together, I would avoid having 4 servers in a normal state. An even number of server nodes (outside of the case of a temporary failure) can make the raft cluster unhappy. If each client is isolated from the others though, then any number is fine :).

(Regarding Pack v Levant, you aren’t alone in feeling this way. I think I am partially to blame for this! We slow-rolled Pack out a bit too much just to make sure the interest was there before going all in on it. The interest from the community definitely is there, but we’ve got to close out 1.4 (which should be a great release!) on our end before really circling back to make it production-grade. So, acknowledged that we’re in a bit of a weird spot right now. It’s something we’re aware of and will fix, but it’ll take a little bit of time.)

2 Likes

I second this :100:

small deployments :white_check_mark:
cost-aware :white_check_mark:
Cloud native, but not K8s :white_check_mark:

In my case it’s part hobby project, part design study for edge deployment. In terms of cost, this is both dollar cost and resource consumption cost, my environment is:

  • 2 Raspberry Pi 4
  • 2 Raspberry Pi 3b
  • 4 Raspberry pi zeros
  • 7 Compute modules
  • wired network to all except pi-zeros

Dedicating Nomad and Consul to the 3+1 Pi3/4 wasn’t an option, so they have run in both server and agent for over a year. I have had some strange behaviour (agents losing their jobs, servers losing quorum, etc, but I think this is attributable to the test environment which has high temperature variation… so - I’m pinning it on hardware). Nomad itself handled failure very well in most cases – servers could die and come back, and agents on that node would recover their jobs and carry on happily.

The hard part, at least for me, was understanding what config goes where, from the documentation. Not that it is badly written, I would have benefitted from a clearer distinction between client and server configuration.

My environment is Vault + Consul + Nomad in that order all on the hardware above (the pi-zeros really come in handy as members of a vault cluster). Since Nomad now supports native service discovery, I’m thinking about experimenting with Consul to recover some resources, but in my experience it has worked really well.

In terms of guardrails, it would be nice for Nomad to be aware of what’s running underneath it. I often had jobs fail due to conflicting resource allocation requests which Nomad thought it was able to fulfill, but which where OOM killed when they came into conflcit with e.g. consul running on the same node.

Hi,

sorry for resurrecting an older thread, but I would also be very interested in clients and servers running on the same machine being supported. My reasons are similar to @brucellino1’s: Having HA without having to dedicate three hosts solely to that.

I’m currently running a single server cluster with the Nomad/Consul/Vault servers co-located on a single host. The big downside I’m seeing here: Whenever I want to update that host, I have to first take down the entire cluster.
At the same time, I’m planning to move away from a single physical server to multiple small machines so that I can do individual updates and reboots.
Needing three physical servers which only host the Nomad/Consul servers seems like a waste, especially considering that the current server host idles at 98% most of the time.

In short: Yes, I believe officially supporting running client and server on the same node is a great idea, especially for smaller setups.

Also a question to @brucellino1 and @rwojsznis if I may: How are you running the server/client on the same node? With the -dev flag and a single agent? With a single agent and both the client and server configs in the same config file? Or with two agents using different ports?

Hey @mmeier86,

Thanks for the feedback!

Having HA without having to dedicate three hosts solely to that.

Just want to note that this would only be “HA” at the application level. For instance, if you have two allocations/instances of an app running and one dies due to code failure, then single-server Nomad would keep it up and healthy. But of course you aren’t HA in the case of VM failure. Probably obvious, but just wanted to clarify in case!

How are you running the server/client on the same node? With the -dev flag and a single agent? With a single agent and both the client and server configs in the same config file?

If you want to run real workloads on the same node, don’t use -dev, as it won’t save your data and it turns of ACLs. I would use a single agent with both client & server configs in the same file, but I don’t think there’s a reason you couldn’t use to agents with different ports.

I would also add some extra Reserved memory and CPU to the client config to account for the additional work the “server” portion is doing - client Block - Agent Configuration | Nomad | HashiCorp Developer

Thanks for your answer, @mnomitch!

Yes, that was clear. :slight_smile: My main motivation for multiple servers is to have the Nomad cluster itself HA. What I want to get rid of is the need to take down all of my jobs when I do maintenance on the single Nomad server host. Once I’ve got my Pi cluster set up, I can do a “node -drain” dance with the Nomad server nodes and restart them one after the other while my cluster and all my jobs stay up, save for the short interruption when nodes get drained.

Overkill for a Homelab? Absolutely. :sweat_smile: But taking down everything (in the right order) when doing OS updates was getting really annoying. :wink:

@mnomitch Having a single node operation mode officially supported would be a huge benefit for me. For those of us that like to run our own hosted VMs, but just need the basics to deploy personal projects, Nomad is the best option out there.

I have no need for running a quorum of servers, as I only run a server on one VM, and a single client on another VM.

The 1.4 release with variables, and the direct integration with Traefik for service discovery, has made Nomad incredibly useful for small deployments. I’m really enjoying the focus of meeting the needs of small deployments that don’t want to step into the world of kubernetes or some “serverless” hosted solution.

1 Like

Appreciate the feedback Larry, and I’m glad the stuff 1.4 is a good fit! - Spread the word! :slight_smile:

A single node option would be amazing for slowly adopting nomad. I’m even exploring nomad after coming from a k3s setup because of the sheer weight of even lightweight k3s.

It opens up a new feature, solution deployment ‘interfaces’, we have this in applications in containers, but you need to glue these together to make a solution. This can be k8s yaml, or nomad jobs or even docker compose…

None of these are compatible though, so you end up having to switch. Same sort of solution but don’t need HA and to run 5 nodes? Better write some systemd unit files or docker compose. Now have the business need to scale?, Ah well you’ve now go to go rewrite into k8s or nomad jobs.

It would be amazing to start with single nodes, build up larger installations, and just ‘shift’ the jobs to the newer instance. This would be great for the service mesh too. If there is a remote workload, it’s still all in mesh, even it it’s some tiny IOT device, just modelled as a separate data entre.

FWIW I’ve had an excellent experience so far with running single-node Nomad using the “unofficial” setup of an agent as both a server and client simultaneously. The use-case is many on-prem/edge metal hosts for SDN/CNF and IoT related workloads. Ideally they can eventually scale horizontally with clusters as needed, and being able to use similar infrastructure in the cloud is a big win.

I too found k3s and other lightweight k8s too burdensome for the edge.

I guess it would be nice to have some “official” recommendations on combined server/client configurations.

Same thing as mentioned above :+1:

Looking for cost effective/modern solution for deploying/maintaining single-node setups for small things, ideally with ability to easily scale this setup when the need arises. Example: MVPs, POCs, even just simple landing pages.

Not happy with k3s at all, as it eats around 700mb for doing nothing + in general kubernetes looks like to much over-engineering here.

1 Like

I have 3 nomad servers that are also running as clients. I run only fabio proxy and promtail on them, to have log collection and nice URLs. They are registered in nomad in a separate datacenter to make sure nothing else will get scheduled on them.

I have overall good experience with that. However, restarting a server with it being also a client may be unpleasant, especially in case of configuration errors i.e. longer downtime the allocations become stuck of something. What I recommend is before restarting, first drain it and make sure nothing is running on it.

Hi @mnomitch, are there any updates on this topic? Very interested to deploy a lean single server node (for testing, internal staging etc) and then have the flexibility to scale to multiple nodes.

HI @mnomitch

I’m in the homelab camp. Starting with one node as server and agent, i might add more nodes if i need more compute.

Biggest challenge right now is: how. How do i configure this so nomad runs as agent and server on the same machine without having to spend weeks and weeks deciphering mysterious and misleading messages.

so far I’ve only been able to see nomad working with the --dev flag.

What would be amazing, is a guide. One that is the next step from the getting started guide but without --dev :

  • getting started: using nomad for local dev (this is the current one)
  • next steps: how to setup a single node cluster without --dev and consul
  • then maybe a community collection of setup examples to help newcomers and novices follow best practices with ready to go and working example setups. Yes you have one example. no it’s not enough.

I’d love to see further investement in this area! I run a 5 node cluster at home and a couple of boxes on different clouds. At each “site”, I run consul, vault and nomad (each site being it’s own region), plus an assortment of apps, so no dedicated control plane at all. At home I run these on an variety of platforms, linux on (amd64/arm/arm64/mips) and macos on (amd64/arm64), all of them consumer-grade hardware, so nothing too fancy (at least 4 cores, 4gb of ram).

It’s been a journey of three years or so, and while my home cluster runs beautifully, I’ve had less success with the "cloud nodes, running with on a single core and 2gb of RAM. I’m now spoiled by my home cluster in terms of operational and development/deployment workflows, but keep struggling with tuning and configuration of these smaller, stand-alone nodes, specially given I’m not running with -dev mode enabled, have ACLs on and my ISP loves to unplug me for hours every few months.

I understand my use case is somewhat out there, so even general guidance on avoiding pitfalls, such as “failed to reconcile member” logs, would be awesome!

I’m in the middle of moving from one monolithic VPS to a couple of small cloud servers.
My setup is purely containerized. I tried to move to podman, but it’s neither on feature parity nor is its protocol stable. Docker might be stale, but served me well.

Until now, I was working with docker compose, docker run and an (un)healthy mix of shell scripts to run game servers, private cloud, dns, pi-hole etc.

With nomad, I hope, I migrate to a solution, which will make maintenance easier eventually, but the onboarding is quite hard right now.

Most tutorials are targeted at hyperscalers and not bare metal. There’s also no “migration guide” for those coming from pure docker or docker swarm, or even fiddly raspberry setups.

I have yet to learn how to mount host volumes or run my shell scripts, which I need until I know how to do it better, with the containers. I intend to start with a single-node setup as an infrastructure backbone, running services until I figure out how to properly add nodes.

As such, I really would appreciate guides aimed at those coming from the docker run and docker compose world

Just another +1 for a supported model for single host. :slight_smile:

1 Like

Likewise would like to see this. I have a cheap single server that has a few non critical WordPress (and similar) sites with dedicated databases and cron jobs.

I’d like to use Nomad and Terraform to manage them, where CPU and memory use is not considered and doesn’t require declaration.

I’m aware that this could result in a noisy neighbor situation but it’s not really a concern in my use case - as has been demonstrated over the last few years of operation.

However the benefit of using Nomad/TF would be improved deployment ergonomics, a nice dashboard and offer a pathway to copy a deployment configuration to a HA configuration should the need arise.

Right now it’s all Docker compose but that’s not as nice

Does anybody have feedback on this sort of use?

Hi. What I did in our premise with ~300 machines is that we have Nomad servers in a separate datacenter called “servers”. That way, we more strictly control which jobs run on the server and DevOps team is specifically aware what jobs run on Nomad servers machines.

Overall bottom line, the issue is that if you run something heavy on the Nomad servers, the Nomad scheduler will start lagging, jobs will be stuck pending, and the website will lag. So just pick jobs with predictable load. Also, running fabio on Nomad servers and serving web page from there is useful when clients are rotating.

It works really well, but there were initial issues. The issue is Nomad interfering with others. The important thing is to clean up nomad stop -purge old jobs. What we observed is that failed jobs use more memory than successful once. Sometimes, with hundreds of jobs Nomad memory spikes drastically to infinity, and there were cases where Nomad server went OOM on our 370GB machine. We keep it clean by nomad stop -purge old jobs, that way Nomad server memory usage keeps below 5G. I tried investigating and posting github issues, however I was never able to reproduce consistently, and I think newer Nomad versions are much better at managing memory, or we are at purging old jobs often enough.

What would be really useful, is separate purging garbage collector settings associated with Nomad namespaces. Like, I don’t care about Jenkins or GitHub-runner workers, but services should never be garbage collected. There is “batch_eval_gc_threshold”, but I would want to set “service_everything_gc_threshold” to infinity.

Are there any technical guardrails you would like to see to enable this?

Yes: