Nomad - remove down or left nodes from cluster

voxsim · January 18, 2022, 1:44pm

We set up a Nomad cluster with a Consul cluster following the best practices provided in the documentation on top of Google cloud infrastructure. So far so good.

We did some tests, but then we have a lot of empty clients when we go on Topology in the Nomad UI Interface

And also we have a lot of down clients

The same goes for left servers

Can we do something in order to remove them from the UI? Thank you

EDIT: We enabled leave_on_interrupt and leave_on_terminate but we still have this problem.

EDIT 2: We have the same problem when we run nomad node status

organizations/gcp/bendingspoons.com on  master [$] via 💠 default on ☁️  sv@bendingspoons.com
❯ nomad node status
ID        DC           Name                     Class   Drain  Eligibility  Status
d71b3262  us-central1  nomad-nomad-client-4l6l  <none>  false  eligible     ready
cb4d299a  us-central1  nomad-nomad-client-1q54  <none>  false  eligible     ready
a06f390c  us-central1  nomad-nomad-client-xxlv  <none>  false  eligible     ready
196f2053  us-central1  nomad-nomad-client-lg7r  <none>  false  eligible     down
5908adc2  us-central1  nomad-nomad-client-dslk  <none>  false  eligible     down
93c3cf1e  us-central1  nomad-nomad-client-25r8  <none>  false  eligible     down
c5ec4fd3  us-central1  nomad-nomad-client-bwb3  <none>  false  eligible     down
ae6ae3a6  us-central1  nomad-nomad-client-jtvb  <none>  false  ineligible   down
a63d21eb  us-central1  nomad-nomad-client-45rf  <none>  false  ineligible   down
bacb8c9e  us-central1  nomad-nomad-client-qxb0  <none>  false  ineligible   down
f3f29acf  us-central1  nomad-nomad-client-3gzp  <none>  false  ineligible   down
e679e47e  us-central1  nomad-nomad-client-sm0k  <none>  false  ineligible   down
7c3a13eb  us-central1  nomad-nomad-client-vznm  <none>  false  ineligible   down
a10d0c66  us-central1  nomad-nomad-client-89j7  <none>  false  ineligible   down
f45f3eb7  us-central1  nomad-nomad-client-2117  <none>  false  ineligible   down
3444c520  us-central1  nomad-nomad-client-bht1  <none>  false  ineligible   down
6c00709e  us-central1  nomad-nomad-client-4mlq  <none>  false  ineligible   down
bdccc266  us-central1  nomad-nomad-client-wrqc  <none>  false  ineligible   down
89193843  us-central1  nomad-nomad-client-2d28  <none>  false  ineligible   down
2acca668  us-central1  nomad-nomad-client-8j8n  <none>  false  ineligible   down
f5593a7b  us-central1  nomad-nomad-client-mrm0  <none>  false  ineligible   down
b9e95158  us-central1  nomad-nomad-client-v5dz  <none>  false  ineligible   down
b42c0d2f  us-central1  nomad-nomad-client-7nnx  <none>  false  ineligible   down
5ff00f7a  us-central1  nomad-nomad-client-8gfs  <none>  false  ineligible   down
2cd75c27  us-central1  nomad-nomad-client-g77h  <none>  false  ineligible   down
bd6d77fb  us-central1  nomad-nomad-client-ps8l  <none>  false  ineligible   down
88606107  us-central1  nomad-nomad-client-xrb6  <none>  false  ineligible   down
2b7ac743  us-central1  nomad-nomad-client-4dkx  <none>  false  ineligible   down
67b723cf  us-central1  nomad-nomad-client-mjz0  <none>  false  eligible     down
2c31618a  us-central1  nomad-nomad-client-6sz1  <none>  false  eligible     down

jrasell · February 2, 2022, 8:51am

Hi @voxsim. Terminal resources such as nodes are cleaned from the Nomad state via an internal garbage collector which runs according to a periodic schedule. This can be triggered manually via the nomad system gc command or via the /v1/system/gc API endpoint. If you wish Nomad to be more aggressive with its periodic garbage collection, you can set the node_gc_threshold server configuration option.

How long have the servers been in their left state for and what triggered this?

Thanks,
jrasell and the Nomad team

shantanugadgil · November 15, 2022, 6:41pm

@jrasell @voxsim

I stumbled upon this discussion when searching for “disconnected clients”.

I noticed today that I have a few nodes which are in disconnected state for more than 3 (three) months!


$ nomad node status -verbose | grep 0265be10-283b-7a2e-9c65-b6a0898501bf
0265be10-283b-7a2e-9c65-b6a0898501bf  my-aws-region  my-node-name               my-node-class               10.72.95.238  1.3.1    false  eligible     disconnected

$ nomad node status 0265be10-283b-7a2e-9c65-b6a0898501bf

error fetching node stats: Unexpected response code: 404 (No path to node)
ID              = 0265be10-283b-7a2e-9c65-b6a0898501bf
Name            = my-node-name
Class           = my-node-class
DC              = my-aws-region
Drain           = false
Eligibility     = eligible
Status          = disconnected
CSI Controllers = <none>
CSI Drivers     = <none>
Host Volumes    = <none>
Host Networks   = loopback
CSI Volumes     = <none>
Driver Status   = docker,exec,java,raw_exec

Node Events
Time                  Subsystem     Message
2022-07-22T07:29:39Z  Cluster       Node heartbeat missed
2022-07-21T09:06:25Z  Driver: java  Healthy
2022-07-21T09:05:26Z  Cluster       Node registered

Allocated Resources
CPU         Memory      Disk
0/4472 MHz  0 B/15 GiB  0 B/95 GiB

Allocation Resource Utilization
CPU         Memory
0/4472 MHz  0 B/15 GiB

error fetching node stats: actual resource usage not present

Allocations
ID        Node ID   Task Group             Version  Desired  Status   Created     Modified
035284df  0265be10  <redacted>                0        run      pending  3mo26d ago  3mo26d ago
2f1266d2  0265be10  <redacted>                0        run      pending  3mo26d ago  3mo26d ago
ac69f8c4  0265be10  <redacted>                0        run      pending  3mo26d ago  3mo26d ago
cbcbe500  0265be10  <redacted>                0        run      pending  3mo26d ago  3mo26d ago
72d36385  0265be10  <redacted>                0        run      pending  3mo26d ago  3mo26d ago
e36d7448  0265be10  <redacted>                0        run      pending  3mo26d ago  3mo26d ago

The node doesn’t go away even if I run multiple nomad system gc commands.

What made it actually work (surprisingly) was when I marked the non-existent node to drain.

Once the drain was complete; again, this node DOES NOT exist in the infrastructure, and the drain showed the following …

$ nomad_oregon node drain -enable -deadline 1m 0943d50b-c28a-63d9-dbd2-9bd8769d020b
2022-11-15T18:22:51Z: Ctrl-C to stop monitoring: will not cancel the node drain
2022-11-15T18:22:51Z: Node "0943d50b-c28a-63d9-dbd2-9bd8769d020b" drain strategy set
2022-11-15T18:22:51Z: Drain complete for node 0943d50b-c28a-63d9-dbd2-9bd8769d020b
2022-11-15T18:22:51Z: Alloc "453938c2-ae48-e3cf-bdea-92f05768acd8" marked for migration
2022-11-15T18:22:51Z: Alloc "5c53b51d-45b8-5f81-9eae-bf0061c3be5c" marked for migration
2022-11-15T18:22:51Z: Alloc "267472e4-bf2b-4836-386e-8802f268fed2" marked for migration
2022-11-15T18:22:51Z: Alloc "a11f4abd-aa1c-2859-0d53-2335ad676b17" marked for migration
2022-11-15T18:22:51Z: Alloc "a73b282a-91b6-89ea-b5ba-5ae8e11c61b8" marked for migration
2022-11-15T18:22:51Z: Alloc "a4c12833-7f67-d7fa-7e19-6cc59f732c64" marked for migration
2022-11-15T18:22:51Z: Alloc "267472e4-bf2b-4836-386e-8802f268fed2" draining
2022-11-15T18:22:51Z: Alloc "5c53b51d-45b8-5f81-9eae-bf0061c3be5c" draining
2022-11-15T18:22:52Z: Alloc "453938c2-ae48-e3cf-bdea-92f05768acd8" draining
2022-11-15T18:22:52Z: Alloc "a4c12833-7f67-d7fa-7e19-6cc59f732c64" draining
2022-11-15T18:22:52Z: Alloc "a11f4abd-aa1c-2859-0d53-2335ad676b17" draining
2022-11-15T18:22:52Z: Alloc "a73b282a-91b6-89ea-b5ba-5ae8e11c61b8" draining

2022-11-15T18:23:20Z: Alloc "267472e4-bf2b-4836-386e-8802f268fed2" status pending -> lost
2022-11-15T18:23:21Z: Alloc "a4c12833-7f67-d7fa-7e19-6cc59f732c64" status pending -> lost
2022-11-15T18:23:21Z: Alloc "a11f4abd-aa1c-2859-0d53-2335ad676b17" status pending -> lost
2022-11-15T18:23:21Z: Alloc "a73b282a-91b6-89ea-b5ba-5ae8e11c61b8" status pending -> lost
2022-11-15T18:23:21Z: Alloc "5c53b51d-45b8-5f81-9eae-bf0061c3be5c" status pending -> lost
2022-11-15T18:23:21Z: Alloc "453938c2-ae48-e3cf-bdea-92f05768acd8" status pending -> lost
2022-11-15T18:23:21Z: All allocations on node "0943d50b-c28a-63d9-dbd2-9bd8769d020b" have stopped

… and then the node went from disconnected to down.

I was then able to make the node go away from the list by doing a nomad system gc

muratyarali777 · June 8, 2023, 8:59am

“nomad system gc” command is worked for me

Topic		Replies	Views
Integrating the nomad with consul Nomad	2	374	February 9, 2022
No Cluster Leader when cluster node is down Nomad	6	4276	November 17, 2021
'nomad node status' is empty Nomad	2	882	December 13, 2019
Node will not leave cluster Nomad	14	2785	August 22, 2019
Reboot and maintenance service for client nodes Nomad	2	423	May 3, 2023

Nomad - remove down or left nodes from cluster

Related topics