Node will not leave cluster

I have a nomad cluster with one host that will not leave the cluster. I can restart nomad and watch the service to goto initializing then ready. However if I stop it, the node just stays in ready. I can even completely shut the system down and it will still show as ready. I have to set the node as ineligible to keep anything from trying to place on it then I can shut it down.

I just upgraded everything to 0.9.4 and the problem is still there.

Thanks for your help!

Tim

Hey Tim,

To figure out what is happening we will need some extra information.
Could you post the client config and the logs?

And the output of :
nomad node status -verbose {broken-node-id}

Here is the config, I will pull logs and get the output of the command above and post. Thanks for the help.

datacenter = "colo1"
data_dir  = "/var/lib/nomad"

bind_addr = "0.0.0.0" # the default

client {
  enabled       = true
  options {
    "driver.raw_exec.enable" = "1"
    "driver.whitelist" = "docker"
  }
  reserved {
    cpu            = 500
    memory         = 512
    disk           = 2048
    reserved_ports = "22,80,8500-8600"
  }
  node_class = "prod"
}


consul {
  address             = "127.0.0.1:8500"
  server_service_name = "nomad"
  client_service_name = "nomad-client"
  auto_advertise      = true
  server_auto_join    = true
  client_auto_join    = true
}
ID          = 0d0a63ab-e0bf-2ef4-947f-ae0cea61a08d
Name        = docker-17
Class       = prod
DC          = colo1
Drain       = false
Eligibility = ineligible
Status      = ready
Uptime      = 162h4m35s

Drivers
Driver  Detected  Healthy  Message  Time
docker  true      true     Healthy  2019-08-05T19:18:39Z

Node Events
Time                  Subsystem  Message                                   Details
2019-08-05T19:40:22Z  Cluster    Node marked as ineligible for scheduling  <none>
2019-08-05T19:10:59Z  Cluster    Node marked as eligible for scheduling    <none>
2019-07-30T19:31:35Z  Drain      Node drain complete                       <none>
2019-07-30T19:31:35Z  Drain      Node drain strategy set                   <none>
2019-07-30T19:29:06Z  Drain      Node drain complete                       <none>
2019-07-30T19:29:05Z  Drain      Node drain strategy set                   <none>
2019-07-30T19:23:55Z  Cluster    Node marked as ineligible for scheduling  <none>
2019-07-26T14:47:57Z  Cluster    Node marked as eligible for scheduling    <none>
2019-07-26T14:46:33Z  Drain      Node drain complete                       <none>
2019-07-26T14:46:33Z  Drain      Node drain strategy set                   <none>

Allocated Resources
CPU          Memory      Disk
0/19500 MHz  0 B/31 GiB  0 B/263 GiB

Allocation Resource Utilization
CPU          Memory
0/19500 MHz  0 B/31 GiB

Host Resource Utilization
CPU           Memory          Disk
24/20000 MHz  753 MiB/31 GiB  15 GiB/294 GiB

Allocations
No allocations placed

Attributes
consul.datacenter             = colo1
consul.revision               = 40cec9846
consul.server                 = false
consul.version                = 1.5.1
cpu.arch                      = amd64
cpu.frequency                 = 2500
cpu.modelname                 = Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
cpu.numcores                  = 8
cpu.totalcompute              = 20000
driver.docker                 = 1
driver.docker.bridge_ip       = 172.17.0.1
driver.docker.os_type         = linux
driver.docker.runtimes        = runc
driver.docker.version         = 18.09.7
driver.docker.volumes.enabled = true
kernel.name                   = linux
kernel.version                = 4.15.0-55-generic
memory.totalbytes             = 33730695168
nomad.advertise.address       = redacted:4646
nomad.revision                = a81aa846a45fb8248551b12616287cb57c418cd6
nomad.version                 = 0.9.4
os.name                       = ubuntu
os.signals                    = SIGTTIN,SIGBUS,SIGINT,SIGPIPE,SIGURG,SIGALRM,SIGSYS,SIGTRAP,SIGSTOP,SIGTSTP,SIGXFSZ,SIGIO,SIGUSR2,SIGWINCH,SIGTERM,SIGTTOU,SIGABRT,SIGHUP,SIGKILL,SIGSEGV,SIGUSR1,SIGCONT,SIGILL,SIGIOT,SIGCHLD,SIGFPE,SIGPROF,SIGQUIT,SIGXCPU
os.version                    = 18.04
unique.cgroup.mountpoint      = /sys/fs/cgroup
unique.consul.name            = docker-17
unique.hostname               = docker-17
unique.network.ip-address     = redacted
unique.storage.bytesfree      = 284097679360
unique.storage.bytestotal     = 315990278144
unique.storage.volume         = /dev/sda2

Logs from host on join

Aug  6 13:51:16 docker-17 systemd[1]: Started Nomad Agent.
Aug  6 13:51:16 docker-17 nomad[27184]: ==> Loaded configuration from /etc/nomad.d/server.conf
Aug  6 13:51:16 docker-17 nomad[27184]: ==> Starting Nomad agent...
Aug  6 13:51:20 docker-17 dockerd[1738]: time="2019-08-06T13:51:20.695596051Z" level=warning msg="failed to retrieve runc version: unknown output format: runc version spec: 1.0.1-dev\n"
Aug  6 13:51:20 docker-17 nomad[27184]: ==> Nomad agent configuration:
Aug  6 13:51:20 docker-17 nomad[27184]:        Advertise Addrs: HTTP: redacted:4646
Aug  6 13:51:20 docker-17 nomad[27184]:             Bind Addrs: HTTP: 0.0.0.0:4646
Aug  6 13:51:20 docker-17 nomad[27184]:                 Client: true
Aug  6 13:51:20 docker-17 nomad[27184]:              Log Level: INFO
Aug  6 13:51:20 docker-17 nomad[27184]:                 Region: global (DC: colo1)
Aug  6 13:51:20 docker-17 nomad[27184]:                 Server: false
Aug  6 13:51:20 docker-17 nomad[27184]:                Version: 0.9.4
Aug  6 13:51:20 docker-17 nomad[27184]: ==> Nomad agent started! Log data will stream in below:
Aug  6 13:51:20 docker-17 nomad[27184]:     2019-08-06T13:51:16.673Z [WARN ] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/var/lib/nomad/plugins
Aug  6 13:51:20 docker-17 nomad[27184]:     2019-08-06T13:51:16.675Z [INFO ] agent: detected plugin: name=java type=driver plugin_version=0.1.0
Aug  6 13:51:20 docker-17 nomad[27184]:     2019-08-06T13:51:16.675Z [INFO ] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Aug  6 13:51:20 docker-17 nomad[27184]:     2019-08-06T13:51:16.675Z [INFO ] agent: detected plugin: name=rkt type=driver plugin_version=0.1.0
Aug  6 13:51:20 docker-17 nomad[27184]:     2019-08-06T13:51:16.675Z [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Aug  6 13:51:20 docker-17 nomad[27184]:     2019-08-06T13:51:16.675Z [INFO ] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Aug  6 13:51:20 docker-17 nomad[27184]:     2019-08-06T13:51:16.675Z [INFO ] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Aug  6 13:51:20 docker-17 nomad[27184]:     2019-08-06T13:51:16.675Z [INFO ] agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
Aug  6 13:51:20 docker-17 nomad[27184]:     2019-08-06T13:51:16.675Z [INFO ] client: using state directory: state_dir=/var/lib/nomad/client
Aug  6 13:51:20 docker-17 nomad[27184]:     2019-08-06T13:51:16.676Z [INFO ] client: using alloc directory: alloc_dir=/var/lib/nomad/alloc
Aug  6 13:51:20 docker-17 nomad[27184]:     2019-08-06T13:51:16.677Z [INFO ] client.fingerprint_mgr.cgroup: cgroups are available
Aug  6 13:51:20 docker-17 nomad[27184]:     2019-08-06T13:51:16.681Z [INFO ] client.fingerprint_mgr.consul: consul agent is available
Aug  6 13:51:20 docker-17 nomad[27184]:     2019-08-06T13:51:20.687Z [INFO ] client.plugin: starting plugin manager: plugin-type=driver
Aug  6 13:51:20 docker-17 nomad[27184]:     2019-08-06T13:51:20.687Z [INFO ] client.plugin: starting plugin manager: plugin-type=device
Aug  6 13:51:20 docker-17 nomad[27184]:     2019-08-06T13:51:20.693Z [INFO ] client.consul: discovered following servers: servers=10.144.202.8:4647,10.144.202.7:4647,10.144.202.9:4647
Aug  6 13:51:20 docker-17 nomad[27184]:     2019-08-06T13:51:20.697Z [INFO ] client: started client: node_id=0d0a63ab-e0bf-2ef4-947f-ae0cea61a08d
Aug  6 13:51:20 docker-17 consul[1103]:     2019/08/06 13:51:20 [INFO] agent: Synced service "_nomad-client-sqmkow7gzamfva3ooon2einyunniq3mn"
Aug  6 13:51:20 docker-17 consul[1103]: agent: Synced service "_nomad-client-sqmkow7gzamfva3ooon2einyunniq3mn"
Aug  6 13:51:20 docker-17 nomad[27184]:     2019-08-06T13:51:20.711Z [INFO ] client: node registration complete
Aug  6 13:51:20 docker-17 consul[1103]:     2019/08/06 13:51:20 [INFO] agent: Synced check "_nomad-check-4d980ee6a535dea7502a1ecc2ce396fc833ba4ba"
Aug  6 13:51:20 docker-17 consul[1103]: agent: Synced check "_nomad-check-4d980ee6a535dea7502a1ecc2ce396fc833ba4ba"
Aug  6 13:51:22 docker-17 consul[1103]:     2019/08/06 13:51:22 [INFO] agent: Synced check "_nomad-check-4d980ee6a535dea7502a1ecc2ce396fc833ba4ba"
Aug  6 13:51:22 docker-17 consul[1103]: agent: Synced check "_nomad-check-4d980ee6a535dea7502a1ecc2ce396fc833ba4ba"
Aug  6 13:51:28 docker-17 nomad[27184]:     2019-08-06T13:51:28.640Z [INFO ] client: node registration complete

Logs from host on nomad shutdown

Aug  6 13:54:19 docker-17 systemd[1]: Stopping Nomad Agent...
Aug  6 13:54:19 docker-17 nomad[27184]: ==> Caught signal: interrupt
Aug  6 13:54:19 docker-17 nomad[27184]:     2019-08-06T13:54:19.842Z [INFO ] agent: requesting shutdown
Aug  6 13:54:19 docker-17 nomad[27184]:     2019-08-06T13:54:19.842Z [INFO ] client: shutting down
Aug  6 13:54:19 docker-17 nomad[27184]:     2019-08-06T13:54:19.842Z [INFO ] client.plugin: shutting down plugin manager: plugin-type=device
Aug  6 13:54:19 docker-17 nomad[27184]:     2019-08-06T13:54:19.843Z [INFO ] client.plugin: plugin manager finished: plugin-type=device
Aug  6 13:54:19 docker-17 nomad[27184]:     2019-08-06T13:54:19.843Z [INFO ] client.plugin: shutting down plugin manager: plugin-type=driver
Aug  6 13:54:19 docker-17 nomad[27184]:     2019-08-06T13:54:19.845Z [INFO ] client.plugin: plugin manager finished: plugin-type=driver
Aug  6 13:54:19 docker-17 consul[1103]:     2019/08/06 13:54:19 [INFO] agent: Deregistered service "_nomad-client-sqmkow7gzamfva3ooon2einyunniq3mn"
Aug  6 13:54:19 docker-17 consul[1103]: agent: Deregistered service "_nomad-client-sqmkow7gzamfva3ooon2einyunniq3mn"
Aug  6 13:54:19 docker-17 consul[1103]:     2019/08/06 13:54:19 [INFO] agent: Deregistered check "_nomad-check-4d980ee6a535dea7502a1ecc2ce396fc833ba4ba"
Aug  6 13:54:19 docker-17 consul[1103]: agent: Deregistered check "_nomad-check-4d980ee6a535dea7502a1ecc2ce396fc833ba4ba"
Aug  6 13:54:19 docker-17 nomad[27184]:     2019-08-06T13:54:19.857Z [INFO ] agent: shutdown complete
Aug  6 13:54:19 docker-17 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
Aug  6 13:54:19 docker-17 systemd[1]: nomad.service: Failed with result 'exit-code'.
Aug  6 13:54:19 docker-17 systemd[1]: Stopped Nomad Agent.

I don’t see anything strange in the output and configs that would lead to this behavior.

Is there someway I can manually remove the machine from nomad’s database? I don’t have the original machine anymore as it got removed as it was supposed to. For now I just have this ghost machine there that is “reporting” CPU, Memory, etc to the nomad cluster. I have it set as ineligible so my jobs don’t fail, but I don’t like the idea of this being in my cluster.

@timistim , have you run nomad node drain on the node to remove it, or was it shutdown / crashed? Draining a node should mark it ineligible, stop any allocations, and then mark it as garbage, which will cause the scheduler to deregister the node.

I did run that when I was originally removing the host, if I run it now it just completes without error.

2019-08-12T10:42:06-05:00: Ctrl-C to stop monitoring: will not cancel the node drain
2019-08-12T10:42:06-05:00: Node “0d0a63ab-e0bf-2ef4-947f-ae0cea61a08d” drain strategy set
2019-08-12T10:42:06-05:00: All allocations on node “0d0a63ab-e0bf-2ef4-947f-ae0cea61a08d” have stopped.
2019-08-12T10:42:07-05:00: No drain strategy set for node 0d0a63ab-e0bf-2ef4-947f-ae0cea61a08d

Then the node still shows up

0d0a63ab colo1 docker-17 prod false ineligible ready

Also if I look at it in the GUI I can actually see memory and CPU being reported to it. I have even manually run the garbage collection in the environment and it is still there. This machine does not exist and has been delete for a few days now.

What version were you running before you upgraded to 0.9.4? Did you update the entire cluster?

0.9.3 and yes I did upgrade the entire cluster.

Sorry it’s taken so long to get back to you here! You should be able to manually force the removal of the node by id using the purge api call

Let me know if that works!

1 Like

That appears to have worked!!! Thanks you so much. My OCD feels much better now.

Awesome! Glad to hear it.