Nomad Autoscaler for heterogeneous batch workloads

hntrmrrs · May 25, 2022, 3:04pm

We’ve been experimenting with Nomad Autoscaler strategies for different workload categories.

One interesting set of workloads is something I will call “immediate batch” which consists of a heterogeneous collection of batch jobs submitted on an ad hoc basis throughout the day. We call it “immediate” because ideally there would be sufficient client resources to schedule them immediately. Because these workloads are very large and typically very bursty, we want to control costs by avoiding clients sitting idle.

It feels like it should be achievable with the pass-through strategy by calculating the total amount of schedulable job resource requirements like so (this assumes a unit of capacity is 32 3.4ghz cpu and 256gb ram):

scaling "batch-highmem" {
  enabled = true
  min     = 0
  max     = 50

  policy {
    check "estimated-capacity" {
      source = "prometheus"
      query =<<EOF
ceil(sum(nomad_client_allocated_cpu{node_class=~"batch:highmem"}) / (32 * 3400)
  + sum(nomad_nomad_blocked_evals_cpu{node_class=~"batch:highmem"}) / (32 * 3400)) 
> ceil(sum(nomad_client_allocated_memory{node_class=~"batch:highmem"}) / (256 * 1024)
  + sum(nomad_nomad_blocked_evals_memory{node_class=~"batch:highmem"}) / (256 * 1024))
or ceil(sum(nomad_client_allocated_memory{node_class=~"batch:highmem"}) / (256 * 1024)
  + sum(nomad_nomad_blocked_evals_memory{node_class=~"batch:highmem"}) / (256 * 1024))
EOF

    strategy "pass-through" {}
  }
}

It also feels super clunky.

Does anyone have a better suggestion for estimating the capacity needed to run all of the submitted jobs for a given node class immediately by scaling out?

pdilyard · October 17, 2023, 5:16pm

I have this exact same scenario. Did you ever find a better solution? This is something that is much more well-supported by the AWS EKS autoscaler, but we’re trying to move our workloads to Nomad instead.