I was able to hack together a multi-node MPI batch job with Nomad. It works decently but there are some missing features and a lot of rough edges. In my opinion, Nomad could become a strong drop-in replacement for Slurm for research computing/HPC if just a bit more support was added.
The main things I had to hack together were
- nodelist discovery, so that mpirun knows where to run (see
templatestanza of jobspec) - minimal cluster functionality so that mpirun is only executed once for the multi-node “cluster” (see
mpirunwrapper script) and the entire “cluster” terminates whenmpirunends on the head node.
If there are better ways to achieve these functions in Nomad, please share them! I am also wondering about the following:
- Is it possible to discover the relevant Nomad service endpoint within the jobspec or allocation? Currently I am hardcoding this (
NOMAD_ADDR). I am running the Nomad service on top of Consul but didn’t find any immediate answers in the Consul CLI. - Is it possible to send the task group
countvalue to the allocations, so that the allocation head node could dynamically setmpirunparameters?
I ran this with the raw_exec driver but I think the same methods will work for isolated workloads, given correct network setups.
In an ideal world, there might be a supercomputer driver for Nomad which would implement some of these features I have hacked together.
Here is the jobspec and the mpirun wrapper script:
my-mpi-job.nomad
job "my-mpi-job" {
datacenters = ["dc1"]
type = "batch"
constraint {
attribute = "${attr.kernel.name}"
value = "linux"
}
group "my-mpi-group" {
count = 2
restart {
attempts = 0
}
network {
mode = "host"
}
task "my-mpi-task" {
driver = "raw_exec"
config {
command = "/vagrant/run.sh"
args = []
}
resources {
cpu = 100 # Mhz
memory = 128 # MB
}
template {
# Begin heredoc
data = <<EOH
#!/bin/bash
NOMAD_ADDR=http://172.20.20.10:4646
HOSTFILE={{ env "NOMAD_ALLOC_DIR" }}/hostfile
NOMAD_EVAL_ID=`curl $NOMAD_ADDR/v1/allocation/{{ env "NOMAD_ALLOC_ID" }} | jq -r .EvalID`
NOMAD_NODE_IDS=`curl $NOMAD_ADDR/v1/evaluation/$NOMAD_EVAL_ID/allocations | jq -r .[].NodeID`
rm -f $HOSTFILE
for id in $NOMAD_NODE_IDS
do
curl $NOMAD_ADDR/v1/node/$id | jq -r '.Attributes."unique.network.ip-address"' >> $HOSTFILE
done
EOH
# End heredoc
destination = "generate_hostfile.sh"
}
}
}
}
run.sh
#!/bin/bash
HOSTFILE="${NOMAD_ALLOC_DIR}/hostfile"
DONEFILE="/tmp/.NOMAD-${NOMAD_JOB_ID}-DONE"
if [ ${NOMAD_ALLOC_INDEX} -eq 0 ]
then
echo "I'm the head node; running MPI..."
bash generate_hostfile.sh
mpirun.openmpi --hostfile ${HOSTFILE} -mca btl_tcp_if_include eth1 -x UCX_NET_DEVICES=eth1 /vagrant/hello-mpi -mpi
while read host
do
echo "Telling $host to stop"
ssh $host touch ${DONEFILE}
done <${HOSTFILE}
else
echo "I'm not the head node; waiting for a signal to stop..."
while [ ! -f ${DONEFILE} ]; do date; sleep 1; done
echo "Got the signal! Stopping..."
fi
rm -f ${DONEFILE}
Thanks!
