Hi,
I’m trying to use nomad to perform batch processing jobs. in these kind of jobs, there is a “process” phase, where you would run 10-100 indexes on a set of data, followed by a “reduce” phase, where you aggregate the data and summarize it.
ideally, i would use groups for that. i’d declare a process group (with count=n), and a single reduce group (with count=1), which will summarize the calculation.
the problem is that i cannot enforce order/dependency of the reduce so it will always run after the process group.
is there a way to do it, or is there another composition i can make to make it work?
Hi @yfarmad, Nomad itself does not do ordered jobs/groups. For these use cases I think it’s common to use something like Apache Airflow to issue parameterized jobs.
thank you @ shoenig for the tip.
i looked a bit on airflow, but my main goal is to distribute batch processing among 10-100 nodes, and it seems that airflow is less about that (correct me if I’m wrong).
it seems airflow is more of a sequential data flow. though airflow does support parallelism, it’s less accessible. i cannot just mark a task with count=25 to make it run distributed as easy as nomad does.
Hello! @yfarmad you are right that Airflow is more for directed acyclic graph ordering of processes. What you are looking for is two sequential phases (map —> reduce).
You can do phase ordering within a task group with the task lifecycle stanza. Add this stanza to the map task lifecycle { hook = “prestart” }, and it will run before the reduce task. All the prestart tasks will run & complete before all the other main tasks (any task without a lifecycle stanza) are started.
Quick question about the architecture of the data: are the map & reduce steps downloading data to the task group allocation before processing? In that case, it would make more sense to combine the map & reduce steps into a single task group so they share a filesystem on the same node. Then you dont have send data over the network in between map & reduce steps.