Mixed spot/ondemand autoscaling with Nomad

I’m considering a migration from EKS to Nomad for a number of reasons, but there’s one specific use case that I can’t find an obvious way of supporting. Bascially, we want to:

  • Use a mixture of spot and ondemand autoscaling groups
  • Launch batch jobs that will prefer to use spot instances
  • If there is no spot capacity across all of the instance types in the ASG for a (preferably configurable) given period (e.g. 5 minutes) automatically fall back to using ondemand instances.

So basically, if there is spot capacity, use it, but use ondemand if there isn’t any for a timeout value.

In Kubernetes terms, this would be usually implemented using node affinity, though practically it mostly has to be implemented in the autoscaler there as well.

I’m not averse to building strategy/target plugins to implement this behaviour if it isn’t doable with the existing Nomad, but it’s not clear to me if this is even possible within the current plugin architecture.

Can anyone provide guidance/suggestions as to how to achieve this?

1 Like

Hi @conorcurlett,

Thanks for using Nomad!

I’m meeting with the lead of the autoscaler project tomorrow, and I’ve added your question to our agenda. I’ll get back to you with some feedback after that, so stay tuned.

Thanks again for being part of the community!

@DerekStrickland and the Nomad Team

Hi @conorcurlett,

Thanks for using Nomad!

After discussing this with the autoscaler team lead, we think you can get part of
the way there with the current feature set. What you can do right now is define
multiple affinity stanza with different weights, that will instruct Nomad to
prefer spot instances based on node metadata you define on the nodes. Here is an
example of how you might do that.

affinity {
    attribute  = "${meta.instance-type}"
    value     = "spot"
    weight    = 100
}

affinity {
    attribute  = "${meta.instance-type}"
    value     = "on-demand"
    weight    = 50
}

If you’ve applied this metadata during provisioning, this will cause Nomad to try
to schedule on spot instances first. If that is good enough, great. If you really
need the 5 minute threshold, and I can imagine why you would, Nomad would have to
be modified. If you do need this feature, please raise an issue on Github, and
we’ll triage it.

Thanks for saying you were open to contributing a PR. If you decide to take a try
at implementing this feature, please tell us that on the Github issue. The guidance I can give you is this. The changes that need to be made do some sort
of timeout failover logic would need to be made in the core scheduler of Nomad
itself, not in any plugin. I’d suggest looking at the scheduler
package and doing a search on affinity. From there, hopefully, you can find the
appropriate place to try to add this feature logic.

Hope that helps. Let me know which way you decide to go.

Cheers!

@DerekStrickland and the Nomad Team

1 Like