Understanding the differences in scheduling approaches between Nomad and Kubernetes

hi there.

Based on a few earlier conversations I’ve had about Nomad, Google Borg, and now Kubernetes I’ve been prompted to read this paper here from the ACM - Designing Cluster Schedulers for Internet-Scale Services. I think one of the authors originally worked on Nomad too, and I had a couple of questions as a result of reading it:


One key point detailed in the paper is how Nomad and Borg seem to use an event-based model for reconciliation, rather than constantly polling to check the state of the cluster:

Schedulers usually track the cluster state and maintain an internal finite-state machine for all the cluster objects they manage, such as clusters, nodes, jobs, and tasks. The two main ways of cluster state reconciliation are level- and edge-triggered mechanisms. The former is employed by schedulers such as Kubernetes, which periodically looks for unplaced work and tries to schedule that work. These kinds of schedulers often suffer from having a fixed baseline latency for reconciliation.

Edge-triggered scheduling is more common. Most schedulers, such as Mesos and Nomad, work on this model. Events are generated when something changes in the cluster infrastructure, such as a task failing, node failing, node joining, etc. Schedulers must react to these events, updating the finite state machine of the cluster objects and modifying the cluster state accordingly.

I’m trying to form a not too detailed mental for these differences to help understand the differences here - does Nomad still use this event-based, almost ‘callback’ like model, and Kubernetes use (for what want of a better term), ‘polling’ in this way?

If so, where ought I look to read a bit more about these?

I know about this page here, which explains the benefits of one over the other:

But the paper above outlines the tradeoffs made, and and what steps are taken to mitigate the downsides:

While event-driven schedulers are faster and more responsive in practice, guaranteeing correctness can be harder since the schedulers have no room to drop or miss the processing of an event. Dropping cluster events will result in the cluster not converging to the right state; jobs might not be in their expected state or have the right number of tasks running. Schedulers usually deal with such problems by making the agents or the source of the cluster event resend the event until they get an acknowledgment from the consumer that the events have persisted.

Thanks in advance!

Hi @mrchrisadams

Nomad does still use this approach. It’s part of what allows it to scale so effectively.

For a good overview of the Nomad scheduler at a high level, I’d start with this article.

It explains the server-side scheduling process pretty well. I’m not sure if there is a similar client-side write-up, but I’ll look/ask around. Of course, you can always read the source :laughing:

Do you have any specific questions, or are you just looking for foundational knowledge?

Ah, I had read that before, but hadn’t realised the significance - thanks @DerekStrickland .

So this bit here:

An evaluation is created any time the external state, either desired or emergent, changes.

Is essentially referring to the ‘sort of callback’ I mentioned before. And this bit below (emphasis mine) for the different kinds of state these refer to all the changes you might need to account for when creating new allocations of resources to end up with the cluster working how you want it to.

The desired state is based on jobs, meaning the desired state changes if a new job is submitted, an existing job is updated, or a job is deregistered. The emergent state is based on the client nodes, and so we must handle the failure of any clients in the system. These events trigger the creation of a new evaluation, as Nomad must evaluate the state of the world and reconcile it with the desired state.

That’s really helpful for my understanding, and I think I know where to look further

Why I was asking

If it’s interesting, my use case was understanding how something like Nomad might be able to run loads of lower priority, pre-emptible jobs that respond to local grid or energy availability conditions, and dynamically reschedule them in a way that optimises for the lowest carbon intensity.

I’ve written a bit about this, below, but it wasn’t so clear to me how to make something like Nomad support this use case:

That is a bit clearer now tho - if you’re “just” responding to changes then presumably it would be a case of a scheduler having some knowledge of:

  • what the different resource pools look like in terms of disk, ram, etc
  • what the projected carbon intensity of the electricity over the next 12-24 hrs would be for each resource pool

If you know that, I think you could come up some kind of (for want of a better term) burn profile for a running a job for a given resource pool, rank them and schedule.

I’ve tried looking through the code with a friend of mine to figure out how you’d actually implement it in this issue, but I think I’ll have to update parts of it now I know a bit more. The link below points to a specific part of the issue where we discuss how you might do this

If you had this, you could essentially have something like between the Solar protocol, and Fly.io in Nomad. That would be pretty neat.

Related - this new paper looks into this field too - they reckon you can get 20% carbon savings through thoughtful scheduling of work that doesn’t need to be done right away.

Anyway - that whole other part, while neat is probably off-topic. I think my original question has been answered now.

Ok! Glad to hear your original question got answered.

In terms of the general idea of making Nomad carbon-aware, we thought about this a bit as a team and came up with a list of ideas very similar to what you discussed in the GH issue link you posted.

These are the possible vectors to pursue we have thought of so far. We’d love to hear if there are more that we are missing or if any of these are misguided. We realize we are brand new to thinking about this problem domain.

NOTE: this is a thought experiment, not a roadmap commitment :smiling_face:

  • Adding carbon_score to node fingerprinting based on factors like cloud provider, datacenter location, time of day, instance size, etc.
  • Periodically re-calculate the carbon_score fingerprint.
  • Adding a configuration to the scheduler config that incorporates :point_up: when ranking nodes.
  • Use constraints to wait to schedule work until a node is available below a desired carbon_score threshold if doesn’t need to be scheduled immediately.
    • Batch jobs seem like the most obvious use case.
    • This could already work if the carbon_score is added to node attributes during fingerprinting.
  • Emitting job metrics via the Prometheus exporter.
  • Emitting metrics for Nomad itself to see if tuning configuration can change your carbon impact.
  • Monitoring jobs for rescheduling to clients with a current better carbon_score .
  • Use emitted metrics to:
    • Calculate a client’s current carbon load.
    • Optimize job placement or reschedule based on :point_up: .
  • Monitoring clients for auto-scaling opportunities based on node carbon_score.
  • Think about energy more generally and look for other meaningful attributes to add or metrics to emit.

I’m sure there’s more. This is a really meaningful idea you have here. Thanks for bringing it to the community.

1 Like

There’s a load other you might apply, depending how much access to information you have from the environment - electricity grids have limits on capacity just like network links have limits on capacity, and if you’re able to shape your own usage profiles, it can be valuable in different ways. I’ve outlined a gentle introduction below, aimed at developers:

There’s also concept called value stacking to refer to how if you are able to respond to grid conditions, you can be compensated for making it easier for others to access a reliable source of energy.

This is most commonly used when talking about batteries, but in many cases, because datacentres use so much power in such a small space, being able to adjust power draw from the facility can work a like a battery in this case, by simply ‘treading more lightly’ on the grid, and freeing up capacity for use.

However, I’d recommend starting with carbon as you have outlined, as in many ways it can act like a proxy for others anyway. Having a carbon_score exposed at a node level, would enable quite a lot of powerful use cases.