Nomad not deregister consul service at host shutdown

Bonjour, :wave:

Nomad 1.6.3 (ACL + TLS)
Consul 1.16.2 (ACL + TLS)
Vault 1.15.1 (ACL + TLS)
Single Node

I have a strange behavior.

When I reboot my host. Nomad not deregister consul services at shutdown.
So when host has started. Nomad Jobs restarted in loop… Because old consul services stay in error in Consul.

I have to stop/start manually each job on Nomad UI.
So, Nomad deregister services and restart job without problem.

What is the Nomad or Consul setting so that the services are “deregistered” when the server is stopped?

Thanks.

Hi,

Dirty solution, inspired by this issue on Github: option to restore eligibility after drain_on_shutdown · Issue #17093 · hashicorp/nomad · GitHub

In /etc/nomad/nomad.hcl I add:

leave_on_terminate = true
leave_on_interrupt = true

[...]

client {
    enabled = true

    servers =  ["127.0.0.1:4647"]

    server_join {
      retry_join = [ "127.0.0.1"  ]
      retry_max = 3
      retry_interval = "15s"
    }

    drain_on_shutdown {
      deadline           = "1m"
      force              = true
      ignore_system_jobs = true
    }

[...]

}

leave_on_terminate & leave_on_interrupt set to true to stop Nomad Jobs for any signal like reboot host or restart Nomad .

But with this. At reboot, Node is not eligible. So I create this systemd service:

[Unit]
Description=Nomad auto Eligibility node service
After=nomad.service

[Service]
Type=oneshot
Restart=on-failure
ExecStartPre=/bin/bash -c "/usr/bin/sleep 60"
ExecStart=/usr/local/bin/ansible-playbook -i localhost, nomad_autoeligibility.yml
User=root
Group=root

[Install]
WantedBy=multi-user.target

and my playbook has as task somthing like:

    - name: "Nomad | Set Eligibility of node"
      ansible.builtin.shell: nomad node eligibility -enable -self

But you can switch to simple bash script.

And as last step, I add to nomad.service in [Unit]

Wants=network-online.target nomad-autoeligibility.service

To finis, systemctl daemon-reload and voilĂ  !

When I reboot host, Nomad stop all job, drain node and after reboot. After 1 minute after Nomad start, node is eligible again and jobs restart.

If someone have better idea, I will happy to try it.

Thanks

I had a problem related to the policies applied to the consul token. Whenever I removed a job, the consul did not remove the records.
I was only able to solve it by applying writing policy to agent and service in consul.

Thanks @diegovitor

I think I already have this in my agent policy:

service_prefix "" {
 "policy" = "write"
 }
 
 node_prefix "" {
 "policy" = "write"
 }
 
 agent_prefix "" {
 "policy" = "write"
 }

Thanks!