We have a cluster in which developers can launch arbitrary batch jobs. The jobs require varying resources, and we want to autoscale up and down based on instances that best fit these resources.
In EKS, the way this worked was that the autoscaler was able to see the CPU and Memory that was requested by the job, and also of the instance type in the ASG. Then, it would scale up to add the number of instances required to schedule all the jobs. For example, say you submitted 4 jobs requesting 8 CPUs each. The autoscaler could add a single 32 CPU instance to schedule all 4 jobs.
In Nomad, I’ve tried two ways of achieving this but neither of them seem to work for my use-case:
- Following the “On-demand Batch Job Cluster Autoscaling” documentation. The problem is that this creates a 1:1 mapping between the number of instances and the number of jobs - which is not what I’m looking for.
- Autoscaling based on CPU allocated vs. CPU available. The problem with this is that the amount of CPU allocated doesn’t change until an allocation is created, which means that my queued jobs either stay queued, or at best the cluster only scales up 1 instance at a time. In our scenario, developers might submit 50 jobs simultaneously that requires scaling the cluster out to 1, 25, or 50 instances depending on resource requirements.
Is there another way to approach this?