I’m trying to start a Nomad batch job that runs two containers on two different hosts. My job consists of two task groups with the constraint distinct_hosts = true
to force the containers to run on the different hosts.
Problem: I would like these containers to start simultaneously. If one of the host’s capacity is full, the entire job should stay in pending
state until all resources are available to schedule both containers. When I try this, it seems like one of the task starts running and the other stays in the “queued” state if one of the host is full.
I also tried to set the job parameter all_at_once = true
. This didn’t help and the documentation also states that it cannot be used for “atomic placement”. The outline of my job:
job "parallel-work" {
type = "batch"
all_at_once = true
constraint {
operator = "distinct_hosts"
value = "true"
}
# Both work1 and work2 should start about the same time
group "work1" {
task "main" {
# To be executed on host A
}
}
group "work2" {
task "main" {
# To be executed on host B
}
}
}
Is there another way to achieve this or any plans on implementing such functionality?
Interesting use case … could there be poststart
tasks which could kill the main task if the other task is not healthy !?
Maybe the health of the other task can be determined using something like dig
or nslookup
?
just-a-thought
Sound like an interesting approach. In my use case, there might be many parallel-work
batch jobs in the queue waiting to be started. I guess one challenge could be a scenario where parallel-work1
has a task running on host A and a parallel-work2
has another task running on host B. Thus, no one can make progress if both needs two hosts. I guess, such cases could be dealt with by sleeping a random amount before checking the health of all expected tasks, so that one of the job would be killed and the other one could make progress. Though the question is how performant such an approach is if a batch job needs many nodes simultaneously and many other jobs are waiting the queue.
Best would be if it could be detected at the scheduling stage, but Nomad is maybe not primarily designed for such batch-heavy workloads.
After investigating multiple schedulers, the algorithm I was looking for here was Gang scheduling.
Nomad doesn’t support this, neither does the default scheduler of Kubernetes. But Kubernetes has scheduler plugins in projects like Volcano and Apache YuniKorn that make Gang scheduling possible. Also, HPC schedulers like Slurm support it.
1 Like