Multiple groups in nomad job

jmwilkinson · September 8, 2021, 1:34am

I have a service that requires a DB to be up and running. As they are intrinsically related, I am trying to keep them together in the same job.

However, if I put them in the same group as separate tasks, I have observed the following:

The db will start coming up
The service will come up, and attempt to connect to the db
This fails, and the service crashes
The failure is noted by nomad, and the both tasks in the group are terminated to try again.
Obviously, the second attempt has the same result, as the DB now must go through initialization once again.

So, I attempted to split it into two task groups. However, whichever group I place first will be allocated and executed, and apparently the job is waiting for that group to finish to do the next one. So even if I put the DB first, the service will never be allocated and thus never come up.

This seems like such a common use case, I assume I am missing something obvious. What is the recommended approach to do this in nomad?

Thanks!

shantanugadgil · September 8, 2021, 3:00pm

There are a few ways in which I think this could be solved, but the first one which comes to mind is:

shantanugadgil · September 8, 2021, 3:09pm

Just realized that the question was for multiple groups and not tasks within a group.

What else came to mind was this:

jmwilkinson · September 8, 2021, 3:22pm

I’m afraid I don’t understand how those links could be used to address the problem I am having, with the group not being allocated. Could you perhaps elaborate on that?

shantanugadgil · September 8, 2021, 3:30pm

“not being allocated” sounds like inadequate resources specified in the task definitions?

jmwilkinson · September 8, 2021, 3:33pm

I split the job into two separate jobs, and one of the two is still not being allocated. So it’s possible there is some underlying issue which is causing confusion about groups and tasks.

The output of the eval status is:

host# nomad eval status dc4cb8eb
ID                 = dc4cb8eb
Create Time        = 13h21m ago
Modify Time        = 13h21m ago
Status             = blocked
Status Description = created to place remaining allocations
Type               = system
TriggeredBy        = queued-allocs
Priority           = 50
Placement Failures = N/A - In Progress

Failed Placements
Task Group "test" (failed to place 1 allocation):

Unfortunately, I am not sure how to proceed to debug this.

…

I am not specifying any resources in the task definition, I was not aware of that requirement, do you have more info on it?

jmwilkinson · September 9, 2021, 5:23pm

I found the resolution to that particular problem seems to have been an issue where I did not remove the db port usage within the network stanza of my second job. Thus, the second job could not allocate since the port was in use by the first job. It seems there could be improvement in reporting there, since there currently is no information to help track down the problem.

Unfortunately, it does not move me closer to an understanding of why the groups are behaving the way they are.

jmwilkinson · September 9, 2021, 7:40pm

I have solved it! And it was, indeed, user error.

Within my job I had the following constraint:

constraint {
    operator  = "distinct_property"
    attribute = "${node.datacenter}"
    value     = "1"
}

github.com/hashicorp/nomad

Feature request: make 'datacenter' optional or allow for a wildcard

opened 10:47AM - 05 Oct 20 UTC

benvanstaveren

type/enhancement theme/scheduling theme/jobspec stage/needs-discussion

Quick use case: we run 3 nomad clusters, each region is comprised of about 4 to …5 datacenters in said region, but our workload doesn't honestly care which datacenter within a region it runs on, as long as a job runs in a particular region. In order to make our developers' lives a bit easier, it'd be great if they can just omit the datacenter entry from the job file, or use something like `datacenter = "*"` Both of these would have the same effect, namely that when datacenter is blank or "*" that any datacenter is acceptable, this perhaps ties into the scheduler code where it could do something smart like give preference to nodes (matching all constraints, of course) in the "least loaded" datacenter. This would partner very well with the new spread scheduler, I feel. We're still on Nomad 0.10 so I'm not sure if this has been added/changed already.

But I had placed it at the job level, not the group level, and it was preventing the allocation from being deployed. Unfortunately, as with the ports, there does not seem to be much visibility into why a decision like that is being made, so silly errors like mine must be solved through experimentation and thinking really hard, rather than debug information.

Topic		Replies	Views
Task within a group not on the same Nomad client Nomad	3	851	July 8, 2020
Not understanding jobs vs groups vs tasks Nomad	5	5161	May 12, 2020
Multiple Tasks in a Single Job with order Nomad	9	4996	May 14, 2021
Can one Nomad task in task group take down other task if fails? Nomad	0	176	May 11, 2023
Schedule tasks on the same node, but configure them independently Nomad	9	43	December 5, 2024

Multiple groups in nomad job

Related topics