The Nomad team is currently exploring how we can better support single-node Nomad deployments. These would be deployments that are running the Nomad agent as both a server and a client simultaneously.
We want to support this use-case for Nomad (with some clearly communicated caveats around HA and noisy neighbors), but we want to make sure to put the proper technical guard-rails in place when we do. This would allow for simpler and less expensive Nomad deployments for test/dev “clusters” or for very cost-conscious users in prod.
Does anybody have feedback on this sort of use? Has anybody run into issues specific to mixed-use Nomad agents? Are there any technical guardrails you would like to see to enable this?
Similarly, if anybody has run a small cluster of mixed agents (i.e. 3 nodes running both servers and clients) and has thoughts, we would be interested to hear feedback.
I have been running something like that (3 to 4 clients, all in server&client mode) for around 6 months. For personal use only, but I think I might go with Nomad as the next infrastructure solution for my cost-aware client (which wouldn’t benefit from Kubernetes and sill would like to benefit from the modern architecture approach).
Technically, I’m not a devops/sysops person, but the whole experience was pretty smooth so far. I ran into different problems related to CSI volumes, but in regards to running nodes in a multi-mode - I can’t say I ran into any issues so far. I have allocated some resources on each node only for Nomad to avoid running into weird problems (like processes suddenly OOMing) and have some systemd services in place for draining node upon shutdown/ensuring nomad reboots in case of a crash.
Aside that I keep everything really simple, rely on internal networking provided for hosting provider (so no proper isolation between services is happening, but I chose to accept that bargain) and unfortunately don’t have suggestions how to improve general experience as it was pretty great thus far overall :).
What I am worried a little bit tho is fragmentation around Nomad’s ecosystem (levant vs nomad-packs - both are kinda not production-grade ready) - but I mentioned that in another topic
Hey @rwojsznis appreciate the feedback and glad that mixed clients are going well for you. The processes you put in place seem like the right ones. If you haven’t already, I’d also set some reserved memory and cpu in the client blocks in the agent config - client Stanza - Agent Configuration | Nomad by HashiCorp (this might be what you were referring to though!)
Something small to note: If your combined agents are clustering together, I would avoid having 4 servers in a normal state. An even number of server nodes (outside of the case of a temporary failure) can make the raft cluster unhappy. If each client is isolated from the others though, then any number is fine :).
(Regarding Pack v Levant, you aren’t alone in feeling this way. I think I am partially to blame for this! We slow-rolled Pack out a bit too much just to make sure the interest was there before going all in on it. The interest from the community definitely is there, but we’ve got to close out 1.4 (which should be a great release!) on our end before really circling back to make it production-grade. So, acknowledged that we’re in a bit of a weird spot right now. It’s something we’re aware of and will fix, but it’ll take a little bit of time.)
Cloud native, but not K8s
In my case it’s part hobby project, part design study for edge deployment. In terms of cost, this is both dollar cost and resource consumption cost, my environment is:
2 Raspberry Pi 4
2 Raspberry Pi 3b
4 Raspberry pi zeros
7 Compute modules
wired network to all except pi-zeros
Dedicating Nomad and Consul to the 3+1 Pi3/4 wasn’t an option, so they have run in both server and agent for over a year. I have had some strange behaviour (agents losing their jobs, servers losing quorum, etc, but I think this is attributable to the test environment which has high temperature variation… so - I’m pinning it on hardware). Nomad itself handled failure very well in most cases – servers could die and come back, and agents on that node would recover their jobs and carry on happily.
The hard part, at least for me, was understanding what config goes where, from the documentation. Not that it is badly written, I would have benefitted from a clearer distinction between client and server configuration.
My environment is Vault + Consul + Nomad in that order all on the hardware above (the pi-zeros really come in handy as members of a vault cluster). Since Nomad now supports native service discovery, I’m thinking about experimenting with Consul to recover some resources, but in my experience it has worked really well.
In terms of guardrails, it would be nice for Nomad to be aware of what’s running underneath it. I often had jobs fail due to conflicting resource allocation requests which Nomad thought it was able to fulfill, but which where OOM killed when they came into conflcit with e.g. consul running on the same node.