We set up a Nomad cluster with a Consul cluster following the best practices provided in the documentation on top of Google cloud infrastructure. So far so good.
We did some tests, but then we have a lot of empty clients when we go on Topology in the Nomad UI Interface
Hi @voxsim. Terminal resources such as nodes are cleaned from the Nomad state via an internal garbage collector which runs according to a periodic schedule. This can be triggered manually via the nomad system gc command or via the /v1/system/gc API endpoint. If you wish Nomad to be more aggressive with its periodic garbage collection, you can set the node_gc_threshold server configuration option.
How long have the servers been in their left state for and what triggered this?
I stumbled upon this discussion when searching for “disconnected clients”.
I noticed today that I have a few nodes which are in disconnected state for more than 3 (three) months!
$ nomad node status -verbose | grep 0265be10-283b-7a2e-9c65-b6a0898501bf
0265be10-283b-7a2e-9c65-b6a0898501bf my-aws-region my-node-name my-node-class 10.72.95.238 1.3.1 false eligible disconnected
$ nomad node status 0265be10-283b-7a2e-9c65-b6a0898501bf
error fetching node stats: Unexpected response code: 404 (No path to node)
ID = 0265be10-283b-7a2e-9c65-b6a0898501bf
Name = my-node-name
Class = my-node-class
DC = my-aws-region
Drain = false
Eligibility = eligible
Status = disconnected
CSI Controllers = <none>
CSI Drivers = <none>
Host Volumes = <none>
Host Networks = loopback
CSI Volumes = <none>
Driver Status = docker,exec,java,raw_exec
Node Events
Time Subsystem Message
2022-07-22T07:29:39Z Cluster Node heartbeat missed
2022-07-21T09:06:25Z Driver: java Healthy
2022-07-21T09:05:26Z Cluster Node registered
Allocated Resources
CPU Memory Disk
0/4472 MHz 0 B/15 GiB 0 B/95 GiB
Allocation Resource Utilization
CPU Memory
0/4472 MHz 0 B/15 GiB
error fetching node stats: actual resource usage not present
Allocations
ID Node ID Task Group Version Desired Status Created Modified
035284df 0265be10 <redacted> 0 run pending 3mo26d ago 3mo26d ago
2f1266d2 0265be10 <redacted> 0 run pending 3mo26d ago 3mo26d ago
ac69f8c4 0265be10 <redacted> 0 run pending 3mo26d ago 3mo26d ago
cbcbe500 0265be10 <redacted> 0 run pending 3mo26d ago 3mo26d ago
72d36385 0265be10 <redacted> 0 run pending 3mo26d ago 3mo26d ago
e36d7448 0265be10 <redacted> 0 run pending 3mo26d ago 3mo26d ago
The node doesn’t go away even if I run multiple nomad system gc commands.
What made it actually work (surprisingly) was when I marked the non-existent node to drain.
Once the drain was complete; again, this node DOES NOT exist in the infrastructure, and the drain showed the following …
$ nomad_oregon node drain -enable -deadline 1m 0943d50b-c28a-63d9-dbd2-9bd8769d020b
2022-11-15T18:22:51Z: Ctrl-C to stop monitoring: will not cancel the node drain
2022-11-15T18:22:51Z: Node "0943d50b-c28a-63d9-dbd2-9bd8769d020b" drain strategy set
2022-11-15T18:22:51Z: Drain complete for node 0943d50b-c28a-63d9-dbd2-9bd8769d020b
2022-11-15T18:22:51Z: Alloc "453938c2-ae48-e3cf-bdea-92f05768acd8" marked for migration
2022-11-15T18:22:51Z: Alloc "5c53b51d-45b8-5f81-9eae-bf0061c3be5c" marked for migration
2022-11-15T18:22:51Z: Alloc "267472e4-bf2b-4836-386e-8802f268fed2" marked for migration
2022-11-15T18:22:51Z: Alloc "a11f4abd-aa1c-2859-0d53-2335ad676b17" marked for migration
2022-11-15T18:22:51Z: Alloc "a73b282a-91b6-89ea-b5ba-5ae8e11c61b8" marked for migration
2022-11-15T18:22:51Z: Alloc "a4c12833-7f67-d7fa-7e19-6cc59f732c64" marked for migration
2022-11-15T18:22:51Z: Alloc "267472e4-bf2b-4836-386e-8802f268fed2" draining
2022-11-15T18:22:51Z: Alloc "5c53b51d-45b8-5f81-9eae-bf0061c3be5c" draining
2022-11-15T18:22:52Z: Alloc "453938c2-ae48-e3cf-bdea-92f05768acd8" draining
2022-11-15T18:22:52Z: Alloc "a4c12833-7f67-d7fa-7e19-6cc59f732c64" draining
2022-11-15T18:22:52Z: Alloc "a11f4abd-aa1c-2859-0d53-2335ad676b17" draining
2022-11-15T18:22:52Z: Alloc "a73b282a-91b6-89ea-b5ba-5ae8e11c61b8" draining
2022-11-15T18:23:20Z: Alloc "267472e4-bf2b-4836-386e-8802f268fed2" status pending -> lost
2022-11-15T18:23:21Z: Alloc "a4c12833-7f67-d7fa-7e19-6cc59f732c64" status pending -> lost
2022-11-15T18:23:21Z: Alloc "a11f4abd-aa1c-2859-0d53-2335ad676b17" status pending -> lost
2022-11-15T18:23:21Z: Alloc "a73b282a-91b6-89ea-b5ba-5ae8e11c61b8" status pending -> lost
2022-11-15T18:23:21Z: Alloc "5c53b51d-45b8-5f81-9eae-bf0061c3be5c" status pending -> lost
2022-11-15T18:23:21Z: Alloc "453938c2-ae48-e3cf-bdea-92f05768acd8" status pending -> lost
2022-11-15T18:23:21Z: All allocations on node "0943d50b-c28a-63d9-dbd2-9bd8769d020b" have stopped
… and then the node went from disconnected to down.
I was then able to make the node go away from the list by doing a nomad system gc