Multi-node batch MPI jobs (Slurm style)

pphysch · September 8, 2021, 10:29pm

I was able to hack together a multi-node MPI batch job with Nomad. It works decently but there are some missing features and a lot of rough edges. In my opinion, Nomad could become a strong drop-in replacement for Slurm for research computing/HPC if just a bit more support was added.

The main things I had to hack together were

nodelist discovery, so that mpirun knows where to run (see template stanza of jobspec)
minimal cluster functionality so that mpirun is only executed once for the multi-node “cluster” (see mpirun wrapper script) and the entire “cluster” terminates when mpirun ends on the head node.

If there are better ways to achieve these functions in Nomad, please share them! I am also wondering about the following:

Is it possible to discover the relevant Nomad service endpoint within the jobspec or allocation? Currently I am hardcoding this (NOMAD_ADDR). I am running the Nomad service on top of Consul but didn’t find any immediate answers in the Consul CLI.
Is it possible to send the task group count value to the allocations, so that the allocation head node could dynamically set mpirun parameters?

I ran this with the raw_exec driver but I think the same methods will work for isolated workloads, given correct network setups.

In an ideal world, there might be a supercomputer driver for Nomad which would implement some of these features I have hacked together.

Here is the jobspec and the mpirun wrapper script:

my-mpi-job.nomad

job "my-mpi-job" {
  datacenters = ["dc1"]

  type = "batch"

  constraint {
    attribute = "${attr.kernel.name}"
    value = "linux"
  }

  group "my-mpi-group" {
    count = 2

    restart {
      attempts = 0
    }

    network {
      mode = "host"
    }

    task "my-mpi-task" {
      driver = "raw_exec"

      config {
        command = "/vagrant/run.sh"
        args = []
      }

      resources {
        cpu = 100 # Mhz
        memory = 128 # MB
      }

      template {
        # Begin heredoc
        data = <<EOH
#!/bin/bash
NOMAD_ADDR=http://172.20.20.10:4646
HOSTFILE={{ env "NOMAD_ALLOC_DIR" }}/hostfile

NOMAD_EVAL_ID=`curl $NOMAD_ADDR/v1/allocation/{{ env "NOMAD_ALLOC_ID" }} | jq -r .EvalID`
NOMAD_NODE_IDS=`curl $NOMAD_ADDR/v1/evaluation/$NOMAD_EVAL_ID/allocations | jq -r .[].NodeID`
rm -f $HOSTFILE
for id in $NOMAD_NODE_IDS
do
        curl $NOMAD_ADDR/v1/node/$id | jq -r '.Attributes."unique.network.ip-address"' >> $HOSTFILE
done
EOH
        # End heredoc
        destination = "generate_hostfile.sh"
      }
    }
  }
}

run.sh

#!/bin/bash
HOSTFILE="${NOMAD_ALLOC_DIR}/hostfile"
DONEFILE="/tmp/.NOMAD-${NOMAD_JOB_ID}-DONE"

if [ ${NOMAD_ALLOC_INDEX} -eq 0 ]
then
        echo "I'm the head node; running MPI..."
        bash generate_hostfile.sh
        mpirun.openmpi --hostfile ${HOSTFILE} -mca btl_tcp_if_include eth1 -x UCX_NET_DEVICES=eth1 /vagrant/hello-mpi -mpi
        while read host
        do
                echo "Telling $host to stop"
                ssh $host touch ${DONEFILE}
        done <${HOSTFILE}
else
        echo "I'm not the head node; waiting for a signal to stop..."
        while [ ! -f ${DONEFILE} ]; do date; sleep 1; done
        echo "Got the signal! Stopping..."
fi

rm -f ${DONEFILE}

Thanks!

brucellino1 · September 9, 2021, 5:39am

I have been pondering this for a while - I have a long background in scientific HPC and distributed computing (think: grid) and Nomad has several design features which have always been missing. However, I always had the impression at first glance that bits were missing for massively-distributed computation.

In my mind, the way to go about dropping MPI applications into Nomad was to create something similar to the typical multi-tier web processing application, with frontend and backend tiers. The frontend in this case would be the cluster head node and the backend of course the workers.

I felt that connecting the compute nodes with Consul would be much easier than what is currently done, and would make HPC centres far more productive by discovering workloads dynamically rather than hardcoding them or using custom hacked-together solutions which aren’t portable from centre to centre. (This consideration would be more important in the case of a large federation, for example).

But I have the feeling that the work should be done more on the MPI side rather than the Nomad side (not that modifications to Nomad couldn’t help). I might be out-of-date, but I recall that MPI uses a single controller, which represents a single point of failure, and in my experience slowed runs down considerably when it failed.

In my mind, a user would declare the topology of their cluster (so many controllers, so many workers) and submit that to Nomad. Nomad would start them up, and using Consul connect them. Instead of using a hosts file, it could use the service discovery mechanism itself. If a node failed, the workload would be rescheduled on a new instance, for example.

This is admittedly just some musing for now, but reading your post made it seem like it could be taken seriously. I would appreciate a further discussion

What are your opinions on how this kind of massively-parallel computation should be done with Hashi tools?

Thanks
Bruce

pphysch · September 9, 2021, 6:21pm

Thanks for the thoughtful response.

My first priority is accommodating brownfield multi-node MPI workloads, the stuff that on-prem researchers are running via Slurm right now. Batch jobs with static flat N>=1 node topology that can just be resubmitted if they fail, no fancy persistent elastic/hierarchical cluster stuff (yet).

The first piece would be a simple Nomad “supercomputer” task driver. It would include the following additional config options on top of the exec drivers:

Job/Group conf: Hostfile destination (default: ${NOMAD_ALLOC_DIR}/hostfile, can be disabled)
Job/Group conf: Nodelist env var (default: NOMAD_NODELIST, can be disabled)
Group conf: Number of head nodes (default: 1, remainder will be workers)
Task conf: Runs on head? (default: true)
Task conf: Runs on worker? (default: false)

In essence, it is quite similar to ongoing work done in adding MPI support to K8s via Kubeflow. There is growing interest in multi-node AI training over MPI as ML architectures get more complex, so this isn’t just for the traditional “legacy” HPC niche.

Ideally, this supercomputer driver allows a user to submit a multi-node MPI job with minimal extra configuration. All they need to do is put their mpirun command and args in the usual task config stanza. Just like any other idiomatic Nomad batch jobspec. Additionally, Nomad’s task group feature lets them run any necessary data transfer/processing operations as separate tasks before and after the main mpirun execution.

Unfortunately, I don’t think it’s quite as simple as submitting a community task driver plugin because of Nomad’s scheduler architecture, namely, that there is always exactly 1 node per Allocation. From a cursory glance at the Task Driver Plugin docs and source, task drivers are only given information about the task’s Allocation. What we really want is information about the task’s Evaluation, because for a static multi-node batch job, we only care about the multiple Allocations within this initial Evaluation. Fortunately, I was able to get all this information by curl’ing the excellent Nomad server HTTP API, but AFAIK there is no reason this info shouldn’t be directly exposed to the driver.

The second little piece for this driver would be tweaking the Nomad clients so that worker Allocations don’t stop and clean up until they’ve received confirmation that (all) head Allocations have stopped. This should be managed by the Nomad server to prevent dangling allocations if a head Allocation fails catastrophically.

Beyond a supercomputer driver, there is room for some extra scheduling features:

Usage quota policy: Have “counter” quotas in addition to the current “gauge” quotas, i.e. this user has 20,000 core hours in their account that they can spend at any rate they choose, and jobs submitted beyond that quota will be denied or run with a degraded/preemptible quality of service.
Resource constraints: Syntax for requesting entire nodes rather than just cpu/cores/mem/etc.

But to be clear, these are just nice extras that could be added at any point or via external tools. What is important is having something resembling the supercomputer driver I describe above.

frobnitzem · October 9, 2023, 2:21pm

+100 to this use case. Has the situation changed?

shoenig · October 25, 2023, 4:33pm

FYI we are soliciting feedback for adding support for Gang Scheduling as a third scheduling algorithm to Nomad, which I think may be helpful here. Please take a look and definitely add any thoughts you may have.

github.com/hashicorp/nomad

Add Gang Scheduling (Feedback Wanted)

opened 05:29PM - 16 Oct 23 UTC

mikenomitch

type/enhancement theme/scheduling stage/needs-discussion

**Feedback wanted!** We have heard from several users that implementing [Gang… Scheduling](https://en.wikipedia.org/wiki/Gang_scheduling) in Nomad would be useful. This would likely come as a third [scheduling algorithm](https://developer.hashicorp.com/nomad/docs/concepts/scheduling/scheduling) outside of "spread" and "binpack". Gang scheduling would ensure that groups of allocations would all run simultaneously in an all-or-nothing fashion. The Nomad team wants to explore if there is demand for gang scheduling. If this is something you or your team would use, please 👍 this post and let us know what your particular use case is, and what needs you have. Have you tried building something like this on top of Nomad already? If anybody wants to chat about this feature, feel free to [grab a time](https://calendly.com/mnomitch/nomad-chat-30).

Topic		Replies	Views
Single-node Nomad (combined server/client agent) feedback Nomad	20	5665	July 15, 2024
Batch jobs: improve performances and number of concurrent executions Nomad	1	814	April 29, 2022
Is it better one job with 1000 tasks or 1000 jobs with one task? Nomad	7	1073	January 30, 2026
Nomad task pending for few minutes Nomad	1	2122	August 30, 2023
Is there a low resource configuration or distribution of nomad (like k3s) or can it be run on a single weaker machine? Nomad	13	3712	June 2, 2023

Multi-node batch MPI jobs (Slurm style)

Related topics