Is there a way to specify the kill_signal used when task is OOM (Out of Memory)?

axsuul · September 29, 2022, 2:43pm

I have background job tasks that need to exit cleanly if they need to stop. These tasks can sometimes grow in memory and should be restarted once they cross a threshold. However, the way that Nomad handles killing these tasks when OOM is to send a SIGKILL and not use the configured kill_signal for either of these configurations

resources {
  memory = 1000
}

resources {
  memory = 500
  memory_max = 1000
}

Any suggestions on how I can get Nomad to kill a task gracefully once it approaches a memory threshold? Thanks

seth.hoenig · September 29, 2022, 3:18pm

Hi @axsuul, the OOM handing is implemented by Linux cgroups.

In the outgoing cgroups v1, what you’re asking for may have been plausible through clever use of memory.oom_control and having Nomad register a per-task watching routine issuing your custom signal.

In the new cgroups v2 world, I don’t think there is an equivalent functionality. You can read about the tools we have to work with in Control Group v2 — The Linux Kernel documentation

The closest I can think of off the top of my head would be to monitor memory.events.local["max"] and send a signal if that value changes, but that’s not the same as actually entering an OOM event.

axsuul · September 29, 2022, 3:42pm

Thanks for your reply! Doing it the cgroups way sounds like it could cause some conflicts and race conditions. Would it be better then to instead set a super high memory_max on the job and then monitor memory usage with a custom script instead?

seth.hoenig · September 29, 2022, 4:10pm

If you own the source of the app, I’d probably try to implement some kind of in-process watcher, e.g.

via runtime package - runtime - Go Packages in Go
or Runtime (Java Platform SE 7 ) in Java,
etc.

But failing that a sidecar that monitors memory usage would probably work too.

axsuul · September 29, 2022, 5:36pm

Thanks.

Is there a good way to get memory usage metrics from Nomad itself? I have tried querying /metrics and /allocation/<alloc-id> endpoints but they don’t return that info. Or what way would you recommend to get memory usage if I’m going to be doing the sidecar method?

Topic		Replies	Views
Nomad marks an OOM-killed allocation as complete and starts another allocation Nomad	4	750	April 18, 2024
Incorrect memory stats on Raspberry pi 4 (Ubuntu) Nomad	0	418	December 25, 2021
Nomad_client_allocs_oom_killed metric is missing Nomad	2	317	June 10, 2024
Nomad v0.12.6 repeatedly killed by oom_reaper after a few thousand completed batch jobs Nomad	3	1050	October 27, 2020
Stopping the Nomad Jobs gracefully Nomad	12	1882	December 21, 2022

Is there a way to specify the kill_signal used when task is OOM (Out of Memory)?

Related topics