Strange problem with Nomad allocating jobs that use up more memory than the machine actually has

benvanstaveren · March 20, 2025, 3:09pm

So the past few days I’ve been noticing some awfully odd behaviour with Nomad; we have had entire clients crashing left right and center, and after a bit of poking it seems Nomad is apparently very content to just allocate all the things. For instance, we have a node with 128Gb memory, and if I search for that node in the topology view, it shows this: 175.01 GiB / 125.7 GiB.

Also if we look at the allocations, it’s running 26 allocations (26 copies of the same job, 3 tasks in the job) total memory per allocation 7136Mb - now 26 times 7136Mb gives us something like 181Gb - which should not be happening; it should not have allocated this many allocations to this node because it physically won’t fit - now granted, the jobs do not use the entire 7136Mb which means so far we’ve been lucky none of them actually have, but we’re starting to see more and more OOM errors.

Nomad server is currently at 1.6.3 (yes, outdated, I know but we can’t update due to… reasons); clients are at 1.6.5 (same here, same reasons for running an older version).

Output from the scheduler config:

nomad operator scheduler get-config -region=xxxxxx
Scheduler Algorithm           = spread
Memory Oversubscription       = false
Reject Job Registration       = false
Pause Eval Broker             = false
Preemption System Scheduler   = true
Preemption Service Scheduler  = false
Preemption Batch Scheduler    = false
Preemption SysBatch Scheduler = false
Modify Index                  = 13117658

So I guess the big question is: Why is Nomad ignoring the actual physical memory size, and allocating more than would actually fit the machine?

Topic		Replies	Views
Client memory reservation is not used for job allocation Nomad	1	206	October 30, 2023
Nomad not rescheduling allocations due to high usage on one node Nomad	2	4153	March 8, 2021
Nomad not being able to allocate tasks even though there's plenty of memory on node Nomad	2	1204	January 3, 2022
Will the scheduler over-allocate a node? Nomad	2	981	February 3, 2020
Nomad v0.12.6 repeatedly killed by oom_reaper after a few thousand completed batch jobs Nomad	3	1088	October 27, 2020

Strange problem with Nomad allocating jobs that use up more memory than the machine actually has

Related topics