Nomad ghost job

Hi,

I’m running a nomad 0.11.0 and consul 1.7.2 cluster. I have issues with an old job that I can’t completely remove.

  • nomad job doesn’t show up in nomad ui
  • associated services are visible in consul ui
  • services removed with consul services deregister -id=.. are restarted
  • docker container associated with old nomad job is automatically restarted if manually stopped.
  • If I issue the nomad job again, the docker part of the job is shown as failing in nomad ui.

I would greatly appreciate help with troubleshooting this strange situation.

1 Like

this happened to me after rebooting the whole cluster at once: containers are scheduled again by nomad after killing the containers, but no job or status can be found.
My work around/fix is deep cleaning the nomad nodes where the containers appear and reattach it to the cluster:

  • drain node
    (nothing should be running here, except for the rogue job)
  • stop nomad and docker
  • empty the nomad and docker working dirs (often /var/lib/nomad and /var/lib/docker)
  • trigger the garbage collector (on cluster server: curl -XPUT http://127.0.0.1:4646/v1/system/gc)
  • reboot the node, add it to the cluster again if it doesn’t automatically

I had to clean up 2 out of 4) nodes this way before the ghost jobs were exorcised.