We’ve been experimenting with Nomad Autoscaler strategies for different workload categories.
One interesting set of workloads is something I will call “immediate batch” which consists of a heterogeneous collection of batch jobs submitted on an ad hoc basis throughout the day. We call it “immediate” because ideally there would be sufficient client resources to schedule them immediately. Because these workloads are very large and typically very bursty, we want to control costs by avoiding clients sitting idle.
It feels like it should be achievable with the pass-through
strategy by calculating the total amount of schedulable job resource requirements like so (this assumes a unit of capacity is 32 3.4ghz cpu and 256gb ram):
scaling "batch-highmem" {
enabled = true
min = 0
max = 50
policy {
check "estimated-capacity" {
source = "prometheus"
query =<<EOF
ceil(sum(nomad_client_allocated_cpu{node_class=~"batch:highmem"}) / (32 * 3400)
+ sum(nomad_nomad_blocked_evals_cpu{node_class=~"batch:highmem"}) / (32 * 3400))
> ceil(sum(nomad_client_allocated_memory{node_class=~"batch:highmem"}) / (256 * 1024)
+ sum(nomad_nomad_blocked_evals_memory{node_class=~"batch:highmem"}) / (256 * 1024))
or ceil(sum(nomad_client_allocated_memory{node_class=~"batch:highmem"}) / (256 * 1024)
+ sum(nomad_nomad_blocked_evals_memory{node_class=~"batch:highmem"}) / (256 * 1024))
EOF
strategy "pass-through" {}
}
}
It also feels super clunky.
Does anyone have a better suggestion for estimating the capacity needed to run all of the submitted jobs for a given node class immediately by scaling out?