Multi-node batch MPI jobs (Slurm style)

I was able to hack together a multi-node MPI batch job with Nomad. It works decently but there are some missing features and a lot of rough edges. In my opinion, Nomad could become a strong drop-in replacement for Slurm for research computing/HPC if just a bit more support was added.

The main things I had to hack together were

  1. nodelist discovery, so that mpirun knows where to run (see template stanza of jobspec)
  2. minimal cluster functionality so that mpirun is only executed once for the multi-node “cluster” (see mpirun wrapper script) and the entire “cluster” terminates when mpirun ends on the head node.

If there are better ways to achieve these functions in Nomad, please share them! I am also wondering about the following:

  • Is it possible to discover the relevant Nomad service endpoint within the jobspec or allocation? Currently I am hardcoding this (NOMAD_ADDR). I am running the Nomad service on top of Consul but didn’t find any immediate answers in the Consul CLI.
  • Is it possible to send the task group count value to the allocations, so that the allocation head node could dynamically set mpirun parameters?

I ran this with the raw_exec driver but I think the same methods will work for isolated workloads, given correct network setups.

In an ideal world, there might be a supercomputer driver for Nomad which would implement some of these features I have hacked together.

Here is the jobspec and the mpirun wrapper script:

my-mpi-job.nomad
job "my-mpi-job" {
  datacenters = ["dc1"]

  type = "batch"

  constraint {
    attribute = "${attr.kernel.name}"
    value = "linux"
  }

  group "my-mpi-group" {
    count = 2

    restart {
      attempts = 0
    }

    network {
      mode = "host"
    }

    task "my-mpi-task" {
      driver = "raw_exec"

      config {
        command = "/vagrant/run.sh"
        args = []
      }

      resources {
        cpu = 100 # Mhz
        memory = 128 # MB
      }

      template {
        # Begin heredoc
        data = <<EOH
#!/bin/bash
NOMAD_ADDR=http://172.20.20.10:4646
HOSTFILE={{ env "NOMAD_ALLOC_DIR" }}/hostfile

NOMAD_EVAL_ID=`curl $NOMAD_ADDR/v1/allocation/{{ env "NOMAD_ALLOC_ID" }} | jq -r .EvalID`
NOMAD_NODE_IDS=`curl $NOMAD_ADDR/v1/evaluation/$NOMAD_EVAL_ID/allocations | jq -r .[].NodeID`
rm -f $HOSTFILE
for id in $NOMAD_NODE_IDS
do
        curl $NOMAD_ADDR/v1/node/$id | jq -r '.Attributes."unique.network.ip-address"' >> $HOSTFILE
done
EOH
        # End heredoc
        destination = "generate_hostfile.sh"
      }
    }
  }
}
run.sh
#!/bin/bash
HOSTFILE="${NOMAD_ALLOC_DIR}/hostfile"
DONEFILE="/tmp/.NOMAD-${NOMAD_JOB_ID}-DONE"

if [ ${NOMAD_ALLOC_INDEX} -eq 0 ]
then
        echo "I'm the head node; running MPI..."
        bash generate_hostfile.sh
        mpirun.openmpi --hostfile ${HOSTFILE} -mca btl_tcp_if_include eth1 -x UCX_NET_DEVICES=eth1 /vagrant/hello-mpi -mpi
        while read host
        do
                echo "Telling $host to stop"
                ssh $host touch ${DONEFILE}
        done <${HOSTFILE}
else
        echo "I'm not the head node; waiting for a signal to stop..."
        while [ ! -f ${DONEFILE} ]; do date; sleep 1; done
        echo "Got the signal! Stopping..."
fi

rm -f ${DONEFILE}

Thanks!

1 Like

I have been pondering this for a while - I have a long background in scientific HPC and distributed computing (think: grid) and Nomad has several design features which have always been missing. However, I always had the impression at first glance that bits were missing for massively-distributed computation.

In my mind, the way to go about dropping MPI applications into Nomad was to create something similar to the typical multi-tier web processing application, with frontend and backend tiers. The frontend in this case would be the cluster head node and the backend of course the workers.

I felt that connecting the compute nodes with Consul would be much easier than what is currently done, and would make HPC centres far more productive by discovering workloads dynamically rather than hardcoding them or using custom hacked-together solutions which aren’t portable from centre to centre. (This consideration would be more important in the case of a large federation, for example).

But I have the feeling that the work should be done more on the MPI side rather than the Nomad side (not that modifications to Nomad couldn’t help). I might be out-of-date, but I recall that MPI uses a single controller, which represents a single point of failure, and in my experience slowed runs down considerably when it failed.

In my mind, a user would declare the topology of their cluster (so many controllers, so many workers) and submit that to Nomad. Nomad would start them up, and using Consul connect them. Instead of using a hosts file, it could use the service discovery mechanism itself. If a node failed, the workload would be rescheduled on a new instance, for example.

This is admittedly just some musing for now, but reading your post made it seem like it could be taken seriously. I would appreciate a further discussion :slight_smile:

What are your opinions on how this kind of massively-parallel computation should be done with Hashi tools?

Thanks
Bruce

1 Like

Thanks for the thoughtful response.

My first priority is accommodating brownfield multi-node MPI workloads, the stuff that on-prem researchers are running via Slurm right now. Batch jobs with static flat N>=1 node topology that can just be resubmitted if they fail, no fancy persistent elastic/hierarchical cluster stuff (yet).

The first piece would be a simple Nomad “supercomputer” task driver. It would include the following additional config options on top of the exec drivers:

  1. Job/Group conf: Hostfile destination (default: ${NOMAD_ALLOC_DIR}/hostfile, can be disabled)
  2. Job/Group conf: Nodelist env var (default: NOMAD_NODELIST, can be disabled)
  3. Group conf: Number of head nodes (default: 1, remainder will be workers)
  4. Task conf: Runs on head? (default: true)
  5. Task conf: Runs on worker? (default: false)

In essence, it is quite similar to ongoing work done in adding MPI support to K8s via Kubeflow. There is growing interest in multi-node AI training over MPI as ML architectures get more complex, so this isn’t just for the traditional “legacy” HPC niche.

Ideally, this supercomputer driver allows a user to submit a multi-node MPI job with minimal extra configuration. All they need to do is put their mpirun command and args in the usual task config stanza. Just like any other idiomatic Nomad batch jobspec. Additionally, Nomad’s task group feature lets them run any necessary data transfer/processing operations as separate tasks before and after the main mpirun execution.

Unfortunately, I don’t think it’s quite as simple as submitting a community task driver plugin because of Nomad’s scheduler architecture, namely, that there is always exactly 1 node per Allocation. From a cursory glance at the Task Driver Plugin docs and source, task drivers are only given information about the task’s Allocation. What we really want is information about the task’s Evaluation, because for a static multi-node batch job, we only care about the multiple Allocations within this initial Evaluation. Fortunately, I was able to get all this information by curl'ing the excellent Nomad server HTTP API, but AFAIK there is no reason this info shouldn’t be directly exposed to the driver.

The second little piece for this driver would be tweaking the Nomad clients so that worker Allocations don’t stop and clean up until they’ve received confirmation that (all) head Allocations have stopped. This should be managed by the Nomad server to prevent dangling allocations if a head Allocation fails catastrophically.

Beyond a supercomputer driver, there is room for some extra scheduling features:

  • Usage quota policy: Have “counter” quotas in addition to the current “gauge” quotas, i.e. this user has 20,000 core hours in their account that they can spend at any rate they choose, and jobs submitted beyond that quota will be denied or run with a degraded/preemptible quality of service.
  • Resource constraints: Syntax for requesting entire nodes rather than just cpu/cores/mem/etc.

But to be clear, these are just nice extras that could be added at any point or via external tools. What is important is having something resembling the supercomputer driver I describe above.

1 Like