Question about group `count` and client allocation

Hi Nomad experts,

I’m new to nomad and I have a question about the count parameter in the group stanza and clients allocated to run the tasks.

The context is, I wanted to prewarm the environment and reduce startup latency by submitting a dummy batch job and having each client node in the cluster download a fairly large docker image in advance.

Say I have a cluster with 50 client nodes, if I specify count=50 in the group Stanza, will nomad allocate each task in a different node? Or the allocations are not necessarily deterministic?

In addition, what’s the best way to make sure each node in the cluster preloads the image in advance?

Thanks!

For the pre-cache docker image download job and assuming you’re using Nomad 1.2.0 or newer you could use the sysbatch scheduler which will allow a job to run on all nodes (within constraints) until completion without restarting it after it exits. Docs for this: Schedulers | Nomad by HashiCorp

I’m not sure if running a pre-sysbatch job to get the docker images cached on the local machine is the best way so ill let someone else chime in on that question, but if you do end up doing this make sure your docker plugin on the clients have their garbage collection set with a bit of extra time so that those lovely cached images don’t get collected before you actually deploy your job (Drivers: Docker | Nomad by HashiCorp see gc under plugin options) :slightly_smiling_face:, also don’t use the ‘latest’ tag on the images as this has an implied force pull when the job starts (Drivers: Docker | Nomad by HashiCorp see force_pull)

As for the count on the group stanza if you use the spread stanza you can define how those groups will be spread across nodes. For example using

spread {
  attribute = "${node.datacenter}"
  weight    = 100
}

Will attempt to spread the workload evenly over the datacenter on specific nodes. Docs for this: spread Stanza - Job Specification | Nomad by HashiCorp

As far as I understand it Nomad will deploy to nodes that weigh the highest when taking scheduler, constraints, affinities, capacity, and current load among others so deployments will not afaik be deterministic. However you can customise how and where things will be deployed by using the constraint, affinity, spread, and a couple of other stanzas in your job files.

Hope that helps a tad :smiley:

1 Like

Thank you so much @CarbonCollins, it helps immensely!

Unfortunately I’m using Nomad 1.1.0 so can’t yet use sysbatch scheduler, but while looking at the schedulers I found the system scheduler which I can probably look into for my current use case…

Just another noob question about the system job, if it exits/fails, will Nomad still try to restart it (assuming I’ve set the restart attempts to zero)

Thanks!

The System scheduler has some special behaviours when comparing it to the service or batch. It’s designed to have a workload run on every node that matches constraints and will also re-evaluate when new nodes join the cluster. It also is intended to run until explicitly stopped either by an operator or preemption so seems that if the job exits itself when using the system scheduler is is marked as a failure but if your restart stanza says no more attempts then that should be the end of the job.

I’ve never used system jobs in this way and it does not feel quite like the correct scheduler to use for this style of job due to the failure on exit If I am honest :sweat_smile: batch would probably be more appropriate for the one time pre-caching job (assuming you cant upgrade to 1.2 and use sysbatch) but you might have issues guaranteeing it run on every node (you need to know how many nodes there are and add the spread constraint afaik.

1 Like