Dependency/order of groups

yfarmad · September 28, 2020, 11:53pm

Hi,
I’m trying to use nomad to perform batch processing jobs. in these kind of jobs, there is a “process” phase, where you would run 10-100 indexes on a set of data, followed by a “reduce” phase, where you aggregate the data and summarize it.

ideally, i would use groups for that. i’d declare a process group (with count=n), and a single reduce group (with count=1), which will summarize the calculation.

the problem is that i cannot enforce order/dependency of the reduce so it will always run after the process group.

is there a way to do it, or is there another composition i can make to make it work?

shoenig · September 29, 2020, 2:45pm

Hi @yfarmad, Nomad itself does not do ordered jobs/groups. For these use cases I think it’s common to use something like Apache Airflow to issue parameterized jobs.

yfarmad · September 29, 2020, 8:53pm

thank you @ shoenig for the tip.
i looked a bit on airflow, but my main goal is to distribute batch processing among 10-100 nodes, and it seems that airflow is less about that (correct me if I’m wrong).

it seems airflow is more of a sequential data flow. though airflow does support parallelism, it’s less accessible. i cannot just mark a task with count=25 to make it run distributed as easy as nomad does.

jazzyfresh · October 12, 2020, 4:24pm

Hello! @yfarmad you are right that Airflow is more for directed acyclic graph ordering of processes. What you are looking for is two sequential phases (map —> reduce).

You can do phase ordering within a task group with the task lifecycle stanza. Add this stanza to the map task lifecycle { hook = “prestart” }, and it will run before the reduce task. All the prestart tasks will run & complete before all the other main tasks (any task without a lifecycle stanza) are started.

Quick question about the architecture of the data: are the map & reduce steps downloading data to the task group allocation before processing? In that case, it would make more sense to combine the map & reduce steps into a single task group so they share a filesystem on the same node. Then you dont have send data over the network in between map & reduce steps.

yfarmad · October 12, 2020, 4:58pm

Wow, that’s exactly the answer i was waiting for
can’t tell you how i appreciate your help

and regarding the data transfer, it’s different than what you have wrote, but it good to know anyway.

10x

Topic		Replies	Views
Multiple Tasks in a Single Job with order Nomad	9	5020	May 14, 2021
DAG support for Task / Group Scheduling Nomad	3	934	October 2, 2024
Nomad System Jobs - Are the first to deploy? Nomad	1	1698	February 26, 2020
Nomad Group and Task question Nomad	6	1379	April 2, 2021
Is it possible to execute a single task within a group sequentially based on the count? Nomad task-dependencies	1	357	July 29, 2022

Dependency/order of groups

Related topics