I setup a test cluster on 5 virtual linux nodes
- 3 Server & Client
- 2 pure clients
When I had consul run on the pure clients, both consul and the nomad client generated a bunch of errors.
I setup a test cluster on 5 virtual linux nodes
When I had consul run on the pure clients, both consul and the nomad client generated a bunch of errors.
Hi Lindsay!
Consul servers participate in the raft state. All nodes in the cluster run Consul agents. A machine can be a Consul agent and a Consul server at the same time.
It’s not clear if by this you’re referring to the Nomad cluster or the Consul cluster. Most likely you mean that the Consul and Nomad servers are running in the same 3 machines and on those machines, and those machines are also Nomad agents.
If you’re running the Nomad agent and server on the same node, this may be a source of your errors – see
There is also a discussion collecting feedback from experience in this situation:
Whereas the issue of running Consul and Nomad servers on the same node is discussed here
I wonder if any of these is relevant to your situation?
Can you help us to understand by describing the errors that are occurring in some more detail?
Thanks!
Hi Bruce, thank you for the very prompt and detailed reply, on a weekend no less!
Sorry for not getting back to you sooner, but I went down a rabbit hole browsing the link you supplied, help my understanding a lot, thank you. This is a test cluster I’m setting up, 5 VM’s host on a cluster of 5 Proxmox Nodes, it has lots of spare capacity.
Have a working cluster now, 3 server/client nodes, 2 clients nodes - shows as healthy in the Nomad and Consul web GUI’s. I can allocate jobs and they get distributed across all nodes. Service discovery seems to work fine, I tried the HAProxy Load Balancing example and it was returning queries distributed across all nodes. Configs included below.
The only thing that puzzles me is I get the following errors in the service status for the Nomad/Consul client only nodes
Nomad:
Mar 05 15:07:25 n905 nomad[658]: 2023-03-05T15:07:25.968+1000 [ERROR] http: request failed: method=GET path=/v1/agent/health?type=server error="{\"server\":{\"ok\":false,\"message\":\"server not enabled\"}}" code=500
Consul:
Mar 05 15:08:41 n905 consul[646]: 2023-03-05T15:08:41.382+1000 [WARN] agent: Check socket connection failed: check=_nomad-check-36d7b549dc793df580412be7372e7d28495a9080 error="dial tcp 0.0.0.0:4648: connect: connection refused"
Mar 05 15:08:41 n905 consul[646]: 2023-03-05T15:08:41.382+1000 [WARN] agent: Check is now critical: check=_nomad-check-36d7b549dc793df580412be7372e7d28495a9080
It seems to be checking for Server health on the client nodes, despite them have the server stanza disabled. Is this normal? can I just safely ignore it?
Thanks - Lindsay
5 nodes, all with bridged ip on br0
All nodes
nodes: n901 - n905
Server/Client : n901, n902, n903
Client: n904, n905
data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"
server {
enabled = true
bootstrap_expect = 3
}
client {
enabled = true
}
data_dir = "/opt/consul"
client_addr = "0.0.0.0"
server = true
bind_addr = "0.0.0.0" # Listen on all IPv4
bootstrap_expect=3
retry_join = ["n901","n902","n903"]
data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"
server {
# license_path is required for Nomad Enterprise as of Nomad v1.1.1+
#license_path = "/etc/nomad.d/license.hclic"
enabled = false
}
client {
enabled = true
}
data_dir = "/opt/consul"
client_addr = "0.0.0.0"
bind_addr = "0.0.0.0" # Listen on all IPv4
retry_join = ["n901","n902","n903"]
@brucellino1 Hi Bruce, think I solved the issue - my own ignorance as expected.
When I setup the cluster I just rolled out preconfigured cloudinit vm’s, that all registered as servers. I then edited the last too (n904, n905) to disable the server stanza’s on them.
But they were still registered as server in the cluster, just with a “left” status, so they were still getting queried to see if they were there. On each client I had to:
sudo systemctrl stop nomad consul
sudo rm -rf /opt/nomad/*
sudo rm -rf /opt/consul/*
sudo shutdown -h now
Then on a running server:
nomad server force-leave n904
nomad server force-leave n905
nomad system gc
After that, I brought n904 & n905 back up, and they aren’t getting server health checks anymore.
Sorry for the digression and thanks for your feedback and info!
Really liking the nomad experience, sure its been a learning curve but it well documented, simple and makes sense.
I spent all last week trying to get a Kubernetes cluster going and that was a “poke my eyes out with a pointy stick” experience. I see why people always recommend using a managed K8 cluster. IMHO, overengineered, fragile, poorly documented and ridiculously complex. Nomad feels like a serious well thought out product, not a cobbled together system of disparate parts.
Not at all! One learns far more by watching things fail than by watching them work perfectly!
Let’s start a club
Honestly, the gravity of K8s is impossible to ignore, especially if one has to do this stuff for a living. It’s a tool for building platforms though, and building is the operative word there. Nomad is already ready. They’re different things, but the problems they are used to solve are very often the same. When the problem is already well-defined, I feel like Nomad is a no-brainer. When you still don’t know what problems you need to solve, K8s’s open-endedness is actually a bit more of a plus.
So glad you’re having fun with Nomad
True, if someone asked me about running a Orchestrator inhouse, I’d definitely would be recommending Nomad, especially for SMB’s, doable with a learning curve.
In the cloud? Managed all the way, and thats as far as I’m aware, K8s only at the moment. Sure, most Admins could setup a Nomad cluster on Azure, AWS or other, but there’s a lot of expertise in setting up fault tolerant secure cluster, not something the average IT dept will have or want the responsibility for.
Having said that, seeing a lot of interest in Nomad from devops forums, wouldn’t be surprised if some cloud providers started offering Managed Nomad, gotta be less of a headache for them.
Next step is setting up Ceph linked volumes, we’ll see if the fun continues