I have a nomad cluster with one host that will not leave the cluster. I can restart nomad and watch the service to goto initializing then ready. However if I stop it, the node just stays in ready. I can even completely shut the system down and it will still show as ready. I have to set the node as ineligible to keep anything from trying to place on it then I can shut it down.
I just upgraded everything to 0.9.4 and the problem is still there.
Thanks for your help!
Tim
eveld
August 6, 2019, 7:36am
2
Hey Tim,
To figure out what is happening we will need some extra information.
Could you post the client config and the logs?
And the output of :
nomad node status -verbose {broken-node-id}
Here is the config, I will pull logs and get the output of the command above and post. Thanks for the help.
datacenter = "colo1"
data_dir = "/var/lib/nomad"
bind_addr = "0.0.0.0" # the default
client {
enabled = true
options {
"driver.raw_exec.enable" = "1"
"driver.whitelist" = "docker"
}
reserved {
cpu = 500
memory = 512
disk = 2048
reserved_ports = "22,80,8500-8600"
}
node_class = "prod"
}
consul {
address = "127.0.0.1:8500"
server_service_name = "nomad"
client_service_name = "nomad-client"
auto_advertise = true
server_auto_join = true
client_auto_join = true
}
ID = 0d0a63ab-e0bf-2ef4-947f-ae0cea61a08d
Name = docker-17
Class = prod
DC = colo1
Drain = false
Eligibility = ineligible
Status = ready
Uptime = 162h4m35s
Drivers
Driver Detected Healthy Message Time
docker true true Healthy 2019-08-05T19:18:39Z
Node Events
Time Subsystem Message Details
2019-08-05T19:40:22Z Cluster Node marked as ineligible for scheduling <none>
2019-08-05T19:10:59Z Cluster Node marked as eligible for scheduling <none>
2019-07-30T19:31:35Z Drain Node drain complete <none>
2019-07-30T19:31:35Z Drain Node drain strategy set <none>
2019-07-30T19:29:06Z Drain Node drain complete <none>
2019-07-30T19:29:05Z Drain Node drain strategy set <none>
2019-07-30T19:23:55Z Cluster Node marked as ineligible for scheduling <none>
2019-07-26T14:47:57Z Cluster Node marked as eligible for scheduling <none>
2019-07-26T14:46:33Z Drain Node drain complete <none>
2019-07-26T14:46:33Z Drain Node drain strategy set <none>
Allocated Resources
CPU Memory Disk
0/19500 MHz 0 B/31 GiB 0 B/263 GiB
Allocation Resource Utilization
CPU Memory
0/19500 MHz 0 B/31 GiB
Host Resource Utilization
CPU Memory Disk
24/20000 MHz 753 MiB/31 GiB 15 GiB/294 GiB
Allocations
No allocations placed
Attributes
consul.datacenter = colo1
consul.revision = 40cec9846
consul.server = false
consul.version = 1.5.1
cpu.arch = amd64
cpu.frequency = 2500
cpu.modelname = Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
cpu.numcores = 8
cpu.totalcompute = 20000
driver.docker = 1
driver.docker.bridge_ip = 172.17.0.1
driver.docker.os_type = linux
driver.docker.runtimes = runc
driver.docker.version = 18.09.7
driver.docker.volumes.enabled = true
kernel.name = linux
kernel.version = 4.15.0-55-generic
memory.totalbytes = 33730695168
nomad.advertise.address = redacted:4646
nomad.revision = a81aa846a45fb8248551b12616287cb57c418cd6
nomad.version = 0.9.4
os.name = ubuntu
os.signals = SIGTTIN,SIGBUS,SIGINT,SIGPIPE,SIGURG,SIGALRM,SIGSYS,SIGTRAP,SIGSTOP,SIGTSTP,SIGXFSZ,SIGIO,SIGUSR2,SIGWINCH,SIGTERM,SIGTTOU,SIGABRT,SIGHUP,SIGKILL,SIGSEGV,SIGUSR1,SIGCONT,SIGILL,SIGIOT,SIGCHLD,SIGFPE,SIGPROF,SIGQUIT,SIGXCPU
os.version = 18.04
unique.cgroup.mountpoint = /sys/fs/cgroup
unique.consul.name = docker-17
unique.hostname = docker-17
unique.network.ip-address = redacted
unique.storage.bytesfree = 284097679360
unique.storage.bytestotal = 315990278144
unique.storage.volume = /dev/sda2
Logs from host on join
Aug 6 13:51:16 docker-17 systemd[1]: Started Nomad Agent.
Aug 6 13:51:16 docker-17 nomad[27184]: ==> Loaded configuration from /etc/nomad.d/server.conf
Aug 6 13:51:16 docker-17 nomad[27184]: ==> Starting Nomad agent...
Aug 6 13:51:20 docker-17 dockerd[1738]: time="2019-08-06T13:51:20.695596051Z" level=warning msg="failed to retrieve runc version: unknown output format: runc version spec: 1.0.1-dev\n"
Aug 6 13:51:20 docker-17 nomad[27184]: ==> Nomad agent configuration:
Aug 6 13:51:20 docker-17 nomad[27184]: Advertise Addrs: HTTP: redacted:4646
Aug 6 13:51:20 docker-17 nomad[27184]: Bind Addrs: HTTP: 0.0.0.0:4646
Aug 6 13:51:20 docker-17 nomad[27184]: Client: true
Aug 6 13:51:20 docker-17 nomad[27184]: Log Level: INFO
Aug 6 13:51:20 docker-17 nomad[27184]: Region: global (DC: colo1)
Aug 6 13:51:20 docker-17 nomad[27184]: Server: false
Aug 6 13:51:20 docker-17 nomad[27184]: Version: 0.9.4
Aug 6 13:51:20 docker-17 nomad[27184]: ==> Nomad agent started! Log data will stream in below:
Aug 6 13:51:20 docker-17 nomad[27184]: 2019-08-06T13:51:16.673Z [WARN ] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/var/lib/nomad/plugins
Aug 6 13:51:20 docker-17 nomad[27184]: 2019-08-06T13:51:16.675Z [INFO ] agent: detected plugin: name=java type=driver plugin_version=0.1.0
Aug 6 13:51:20 docker-17 nomad[27184]: 2019-08-06T13:51:16.675Z [INFO ] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Aug 6 13:51:20 docker-17 nomad[27184]: 2019-08-06T13:51:16.675Z [INFO ] agent: detected plugin: name=rkt type=driver plugin_version=0.1.0
Aug 6 13:51:20 docker-17 nomad[27184]: 2019-08-06T13:51:16.675Z [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Aug 6 13:51:20 docker-17 nomad[27184]: 2019-08-06T13:51:16.675Z [INFO ] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Aug 6 13:51:20 docker-17 nomad[27184]: 2019-08-06T13:51:16.675Z [INFO ] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Aug 6 13:51:20 docker-17 nomad[27184]: 2019-08-06T13:51:16.675Z [INFO ] agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
Aug 6 13:51:20 docker-17 nomad[27184]: 2019-08-06T13:51:16.675Z [INFO ] client: using state directory: state_dir=/var/lib/nomad/client
Aug 6 13:51:20 docker-17 nomad[27184]: 2019-08-06T13:51:16.676Z [INFO ] client: using alloc directory: alloc_dir=/var/lib/nomad/alloc
Aug 6 13:51:20 docker-17 nomad[27184]: 2019-08-06T13:51:16.677Z [INFO ] client.fingerprint_mgr.cgroup: cgroups are available
Aug 6 13:51:20 docker-17 nomad[27184]: 2019-08-06T13:51:16.681Z [INFO ] client.fingerprint_mgr.consul: consul agent is available
Aug 6 13:51:20 docker-17 nomad[27184]: 2019-08-06T13:51:20.687Z [INFO ] client.plugin: starting plugin manager: plugin-type=driver
Aug 6 13:51:20 docker-17 nomad[27184]: 2019-08-06T13:51:20.687Z [INFO ] client.plugin: starting plugin manager: plugin-type=device
Aug 6 13:51:20 docker-17 nomad[27184]: 2019-08-06T13:51:20.693Z [INFO ] client.consul: discovered following servers: servers=10.144.202.8:4647,10.144.202.7:4647,10.144.202.9:4647
Aug 6 13:51:20 docker-17 nomad[27184]: 2019-08-06T13:51:20.697Z [INFO ] client: started client: node_id=0d0a63ab-e0bf-2ef4-947f-ae0cea61a08d
Aug 6 13:51:20 docker-17 consul[1103]: 2019/08/06 13:51:20 [INFO] agent: Synced service "_nomad-client-sqmkow7gzamfva3ooon2einyunniq3mn"
Aug 6 13:51:20 docker-17 consul[1103]: agent: Synced service "_nomad-client-sqmkow7gzamfva3ooon2einyunniq3mn"
Aug 6 13:51:20 docker-17 nomad[27184]: 2019-08-06T13:51:20.711Z [INFO ] client: node registration complete
Aug 6 13:51:20 docker-17 consul[1103]: 2019/08/06 13:51:20 [INFO] agent: Synced check "_nomad-check-4d980ee6a535dea7502a1ecc2ce396fc833ba4ba"
Aug 6 13:51:20 docker-17 consul[1103]: agent: Synced check "_nomad-check-4d980ee6a535dea7502a1ecc2ce396fc833ba4ba"
Aug 6 13:51:22 docker-17 consul[1103]: 2019/08/06 13:51:22 [INFO] agent: Synced check "_nomad-check-4d980ee6a535dea7502a1ecc2ce396fc833ba4ba"
Aug 6 13:51:22 docker-17 consul[1103]: agent: Synced check "_nomad-check-4d980ee6a535dea7502a1ecc2ce396fc833ba4ba"
Aug 6 13:51:28 docker-17 nomad[27184]: 2019-08-06T13:51:28.640Z [INFO ] client: node registration complete
Logs from host on nomad shutdown
Aug 6 13:54:19 docker-17 systemd[1]: Stopping Nomad Agent...
Aug 6 13:54:19 docker-17 nomad[27184]: ==> Caught signal: interrupt
Aug 6 13:54:19 docker-17 nomad[27184]: 2019-08-06T13:54:19.842Z [INFO ] agent: requesting shutdown
Aug 6 13:54:19 docker-17 nomad[27184]: 2019-08-06T13:54:19.842Z [INFO ] client: shutting down
Aug 6 13:54:19 docker-17 nomad[27184]: 2019-08-06T13:54:19.842Z [INFO ] client.plugin: shutting down plugin manager: plugin-type=device
Aug 6 13:54:19 docker-17 nomad[27184]: 2019-08-06T13:54:19.843Z [INFO ] client.plugin: plugin manager finished: plugin-type=device
Aug 6 13:54:19 docker-17 nomad[27184]: 2019-08-06T13:54:19.843Z [INFO ] client.plugin: shutting down plugin manager: plugin-type=driver
Aug 6 13:54:19 docker-17 nomad[27184]: 2019-08-06T13:54:19.845Z [INFO ] client.plugin: plugin manager finished: plugin-type=driver
Aug 6 13:54:19 docker-17 consul[1103]: 2019/08/06 13:54:19 [INFO] agent: Deregistered service "_nomad-client-sqmkow7gzamfva3ooon2einyunniq3mn"
Aug 6 13:54:19 docker-17 consul[1103]: agent: Deregistered service "_nomad-client-sqmkow7gzamfva3ooon2einyunniq3mn"
Aug 6 13:54:19 docker-17 consul[1103]: 2019/08/06 13:54:19 [INFO] agent: Deregistered check "_nomad-check-4d980ee6a535dea7502a1ecc2ce396fc833ba4ba"
Aug 6 13:54:19 docker-17 consul[1103]: agent: Deregistered check "_nomad-check-4d980ee6a535dea7502a1ecc2ce396fc833ba4ba"
Aug 6 13:54:19 docker-17 nomad[27184]: 2019-08-06T13:54:19.857Z [INFO ] agent: shutdown complete
Aug 6 13:54:19 docker-17 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
Aug 6 13:54:19 docker-17 systemd[1]: nomad.service: Failed with result 'exit-code'.
Aug 6 13:54:19 docker-17 systemd[1]: Stopped Nomad Agent.
eveld
August 12, 2019, 11:46am
7
I don’t see anything strange in the output and configs that would lead to this behavior.
Is there someway I can manually remove the machine from nomad’s database? I don’t have the original machine anymore as it got removed as it was supposed to. For now I just have this ghost machine there that is “reporting” CPU, Memory, etc to the nomad cluster. I have it set as ineligible so my jobs don’t fail, but I don’t like the idea of this being in my cluster.
@timistim , have you run nomad node drain
on the node to remove it, or was it shutdown / crashed? Draining a node should mark it ineligible, stop any allocations, and then mark it as garbage, which will cause the scheduler to deregister the node.
I did run that when I was originally removing the host, if I run it now it just completes without error.
2019-08-12T10:42:06-05:00: Ctrl-C to stop monitoring: will not cancel the node drain
2019-08-12T10:42:06-05:00: Node “0d0a63ab-e0bf-2ef4-947f-ae0cea61a08d” drain strategy set
2019-08-12T10:42:06-05:00: All allocations on node “0d0a63ab-e0bf-2ef4-947f-ae0cea61a08d” have stopped.
2019-08-12T10:42:07-05:00: No drain strategy set for node 0d0a63ab-e0bf-2ef4-947f-ae0cea61a08d
Then the node still shows up
0d0a63ab colo1 docker-17 prod false ineligible ready
Also if I look at it in the GUI I can actually see memory and CPU being reported to it. I have even manually run the garbage collection in the environment and it is still there. This machine does not exist and has been delete for a few days now.
What version were you running before you upgraded to 0.9.4? Did you update the entire cluster?
0.9.3 and yes I did upgrade the entire cluster.
Sorry it’s taken so long to get back to you here! You should be able to manually force the removal of the node by id using the purge api call
Let me know if that works!
1 Like
That appears to have worked!!! Thanks you so much. My OCD feels much better now.
Awesome! Glad to hear it.