Hi folks.
This might be a bit of a weird query.
Does Nomad offer any access to some of the power management features on hosts with the Nomad client installed? I’m specifically thinking about stuff like controlling the number of cores available or frequency scaling on a host.
If it helps, I’m asking in the context of this paper here - which has some really interesting ideas which I think might be relevant to orchestration with Nomad:
This podcast here, with the founder of EmeraldAI, a company commercialising this idea is of interest too:
Anyway, here’s the abstract from the paper “Turning AI Data Centers into Grid-Interactive Assets: Results from a Field Demonstration in Phoenix, Arizona”, that got me thinking about the applicability of Nomad in this way.
The paper talks about using just scheduling control to meaningfully control the power used by a cluster of servers, so that when a demand from the signal to reduce load came in, it was possible to meet the request using only software:
Conducted at a 256-GPU cluster running representative AI workloads within a commercial, hyperscale cloud data center in Phoenix, Arizona, the trial achieved a 25% reduction in cluster power usage for three hours during peak grid events while maintaining AI quality of service (QoS) guarantees. By orchestrating AI workloads based on real-time grid signals without hardware modifications or energy storage, this platform reimagines data centers as grid-interactive assets that enhance grid reliability, advance affordability, and accelerate AI’s development.
One of the key ideas here is that a payoff of flexible use of compute is that it becomes easier to accommodate new loads on the grid, without needing to rely on building more gas turbines.
This is good from a climate perspective (my interest), but for people who are trying to get capacity online, and given that there is a years-long backlog for new gas turbines, it might mean deploying faster.
Here’s the relevant part in the paper:
Recent studies demonstrate that load flexibility for AI data centers to reduce power use by roughly 25% for up to 200 hours a year, or far less than 1% of the time, could unlock up to 100 GW of new data center capacity in the U.S. without requiring extensive new generation or transmission infrastructure [11, 13]–enough to meet projected AI growth for the next decade.
I think the extra capacity this would free up is comparable to 2x all active data center capacity in the USA in 2024 (!).
How to do this with Nomad?
From what I can tell, the approach used is based around classifying jobs based on how tolerant they are to changes made to a host machine, that affect how quickly it can run a job. The scheduler would then make runtime changes to the machines full of jobs that would tolerate changes by, scaling back the clock frequency, or cores in use. From memory, I think there were some different re-scheduling and checkpointing decisions made too.
From the paper:
Our workload tagging schema classifies jobs into flexibility tiers based on user tolerance for runtime or throughput deviations.
And then later, how it uses this classification
(the conductor) dynamically schedules jobs, modifies resource allocations for each job, and applies power-limiting techniques such as GPU frequency scaling. To guide its decisions, Conductor uses the Emerald Simulator, a system-level model trained to predict the power-performance behavior of AI jobs. The simulator evaluates the trade-offs of various orchestration strategies under operational constraints and grid needs, recommending an orchestration strategy to assure AI workload QoS guarantees while also meeting power grid response commitments.
I think this seems doable with Nomad jobs, with the ‘meta’ tagging feature listed below:
https://developer.hashicorp.com/nomad/docs/job-specification/meta
But to do this, I think you would need to be able to make live changes to hosts in response to signals from the grid, like demand on the grid or grid carbon intensity.
In the paper, I think there are two main mechanisms used to control power demand to stay inside a given power envelope that is changing based on demand on the grid:
- how many cores / GPUs are available at a given time, and
- the frequency of all those cores are running at (I think this is done through variable frequency scaling)
I know a while back there was an experimental fork of nomad that had some of these scheduling ideas built in:
and here’s the README for it
Does anyone know if further work with Nomad has been carried out, or if there have been any further experiments since then?