Is this grid aware scheduling magic possible with Nomad?

mrchrisadams · September 9, 2025, 12:50pm

Hi folks.

This might be a bit of a weird query.

Does Nomad offer any access to some of the power management features on hosts with the Nomad client installed? I’m specifically thinking about stuff like controlling the number of cores available or frequency scaling on a host.

If it helps, I’m asking in the context of this paper here - which has some really interesting ideas which I think might be relevant to orchestration with Nomad:

This podcast here, with the founder of EmeraldAI, a company commercialising this idea is of interest too:

Anyway, here’s the abstract from the paper “Turning AI Data Centers into Grid-Interactive Assets: Results from a Field Demonstration in Phoenix, Arizona”, that got me thinking about the applicability of Nomad in this way.

The paper talks about using just scheduling control to meaningfully control the power used by a cluster of servers, so that when a demand from the signal to reduce load came in, it was possible to meet the request using only software:

Conducted at a 256-GPU cluster running representative AI workloads within a commercial, hyperscale cloud data center in Phoenix, Arizona, the trial achieved a 25% reduction in cluster power usage for three hours during peak grid events while maintaining AI quality of service (QoS) guarantees. By orchestrating AI workloads based on real-time grid signals without hardware modifications or energy storage, this platform reimagines data centers as grid-interactive assets that enhance grid reliability, advance affordability, and accelerate AI’s development.

One of the key ideas here is that a payoff of flexible use of compute is that it becomes easier to accommodate new loads on the grid, without needing to rely on building more gas turbines.

This is good from a climate perspective (my interest), but for people who are trying to get capacity online, and given that there is a years-long backlog for new gas turbines, it might mean deploying faster.

Here’s the relevant part in the paper:

Recent studies demonstrate that load flexibility for AI data centers to reduce power use by roughly 25% for up to 200 hours a year, or far less than 1% of the time, could unlock up to 100 GW of new data center capacity in the U.S. without requiring extensive new generation or transmission infrastructure [11, 13]–enough to meet projected AI growth for the next decade.

I think the extra capacity this would free up is comparable to 2x all active data center capacity in the USA in 2024 (!).

How to do this with Nomad?

From what I can tell, the approach used is based around classifying jobs based on how tolerant they are to changes made to a host machine, that affect how quickly it can run a job. The scheduler would then make runtime changes to the machines full of jobs that would tolerate changes by, scaling back the clock frequency, or cores in use. From memory, I think there were some different re-scheduling and checkpointing decisions made too.

From the paper:

Our workload tagging schema classifies jobs into flexibility tiers based on user tolerance for runtime or throughput deviations.

And then later, how it uses this classification

(the conductor) dynamically schedules jobs, modifies resource allocations for each job, and applies power-limiting techniques such as GPU frequency scaling. To guide its decisions, Conductor uses the Emerald Simulator, a system-level model trained to predict the power-performance behavior of AI jobs. The simulator evaluates the trade-offs of various orchestration strategies under operational constraints and grid needs, recommending an orchestration strategy to assure AI workload QoS guarantees while also meeting power grid response commitments.

I think this seems doable with Nomad jobs, with the ‘meta’ tagging feature listed below:

https://developer.hashicorp.com/nomad/docs/job-specification/meta

But to do this, I think you would need to be able to make live changes to hosts in response to signals from the grid, like demand on the grid or grid carbon intensity.

In the paper, I think there are two main mechanisms used to control power demand to stay inside a given power envelope that is changing based on demand on the grid:

how many cores / GPUs are available at a given time, and
the frequency of all those cores are running at (I think this is done through variable frequency scaling)

I know a while back there was an experimental fork of nomad that had some of these scheduling ideas built in:

and here’s the README for it

github.com/hashicorp/nomad

CARBON.md

h-carbon-meta

# Carbon-aware Nomad Experiment

README for branch: https://github.com/hashicorp/nomad/blob/h-carbon-meta/

This branch is an experiment to enable Nomad to minimize the climate impact of
the compute it manages. In particular it takes the carbon impact of nodes into
account when scheduling: prioritizing the use of of lower-carbon-producing
compute.

## Example

```
$ nomad agent -dev -config carbon.hcl  # see below for hcl

# In another terminal
$ nomad init
$ nomad run example.nomad
$ nomad alloc status -verbose <alloc id here>
ID                  = 144a2dfc-45de-6e2f-74f0-7cc885c2c846
Eval ID             = d04918d7-02f0-6f64-f46f-0995fa012c2e

This file has been truncated. show original

Does anyone know if further work with Nomad has been carried out, or if there have been any further experiments since then?

marco-m · October 19, 2025, 9:12am

Hello @mrchrisadams

Thanks for sharing.

Some time ago I had stumbled upon PEAKS: Power Efficiency Aware Kubernetes Scheduler (see also the other projects in that organization) and though it would be nice to try to do something similar for Nomad. As you, I am interested in respecting as much as I can the climate. For the moment, I didn’t find the time.

Topic		Replies	Views
Trying to optimise climate impact form scheduling with Nomad Nomad	2	910	February 2, 2023
Migrating from Mesos Nomad	4	947	March 7, 2022
Deploy a service on particular physical server Nomad	4	824	April 9, 2020
How to scale a job automatically to fill capacity? Nomad	5	597	April 21, 2021
Nomad AutoScaler - Is is possible to set it up on on premise Nomad	2	441	May 28, 2022

Is this grid aware scheduling magic possible with Nomad?

Related topics