How can I ask "please nomad, kindly place <job> onto this node"

jrwren · March 17, 2022, 8:29pm

I noticed that the a node which I expected to be running an allocated job, isn’t. Is there a way to recover that? I want to say, “please nomad, kindly place onto this node”

The job definition definitely matches, there are 149 OTHER nodes named similarly which do run this job. The job has constraints which match this node.

To be clear, there are other jobs allocated on this node perfectly. I’d rather not nomad node drain -enable -self and nomad node drain -disable -self as those running jobs have established TCP connections I’d not want to reset.

jrasell · March 18, 2022, 7:56am

Hi @jrwren. I would recommend looking at job constraints which could be used to ensure the job is placed on a specific node.

The constraint may look something like the following:

constraint {
    attribute = "${node.unique.id}"
    value     = "9afa5da1-8f39-25a2-48dc-ba31fd7c0023"
}

Thanks,
jrasell and the Nomad team

jrwren · March 18, 2022, 12:23pm

I tried to say that this is already done. The job constraint is in place. There are 2 other jobs with the exact same constraint allocated on this node. The job which isn’t allocated on this node once was allocated on this node but something happened to make it disappear and nomad either isn’t aware or didn’t reallocated it.

jrasell · March 18, 2022, 12:39pm

Hi @jrwren. Do you have any logs or error messages to help understand what happened? Are you able to try stopping the job, and resubmitting it?

Thanks,
jrasell and the Nomad team

jrwren · March 18, 2022, 1:14pm

I don’t know how to look for logs specific to that node and job.

I ~~cannot~~do not know how to stop and resubmit the job without interrupting the other 149 allocations. Doing so would interrupt service. We are planning on doing that in a few days or so for other reasons. I’m looking forward to see how the allocations change at that point, but it certainly would be nice to recover this one missing allocation on this one node.

jrwren · November 3, 2022, 2:43pm

It turns out the underlying docker installed on these systems suffers from a bug in which sometimes a dying container is not entirely cleaned up. It shows in docker ps output as running, but there are no processes. nomad sees it as running even though the health checks are not able to run. This seems to prevent nomad from starting a new one since it thinks it is already running.

Best fix: upgrade the docker package.

Topic		Replies	Views
System job and constraint Nomad	4	745	June 8, 2022
Nomad system jobs end up losing all allocations for no apparent reason, and not restarting them Nomad	2	554	February 21, 2024
Ensuring two jobs cannot run on the same node Nomad	7	2116	February 14, 2022
Nomad job constraint on driver options does not find node Nomad	7	1096	October 25, 2021
Nomad not rescheduling system jobs on nodes that previously ran out of disk space Nomad	2	296	July 7, 2022

How can I ask "please nomad, kindly place <job> onto this node"

Related topics