I’m trying to understand what’s the reason behind allocations staying for 1-2 minutes in the
Client Status = pending state. The allocations are for dispatched jobs, they use the
This is sample client TRACE log:
022-07-18T06:34:47.232Z [TRACE] client.cpuset.v2: add allocation: name=pm-job-1/dispatch-1658126074-a8a797ce.main id=8aa0ddd3-114e-6556-4725-0c428da894ea
2022-07-18T06:36:52.203Z [DEBUG] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=8aa0ddd3-114e-6556-4725-0c428da894ea task=main
So more than 2 minutes elapsed between
add allocation and
lifecycle start condition has been met. How can I check what Nomad was waiting for in that period?
The client is oversubscribed with CPU and runs about 250 allocs, but my understanding is that once the allocation is created, the resources are assigned and there are no further checks of real CPU usage etc.?
Disk I/O is not exhausted, the Docker containers start fast using
docker command (also Nomad logs indicate that once
start condition has been met, a container is created quickly),
nomad process CPU usage is about 60% of a VCPU.
My guess is this was caused by overloading
cgroups manager on the kernel side. I captured
nomad client profile and almost all time was spent on writing to
cgroups files during allocs cleanups. And it looks like generally the throughput of
cgroups operations the Linux kernel is able to sustain is rather low (from a few to tens of operations per second).
My fix was reducing the overallocation of CPU (to reduce the number of allocs running) and adding
cgroup.memory=nokmem,nosocket to the kernel parameters to make cgroups operations faster. So far it looks to be working well after about 24h, the start times are 1-2 seconds.
The container creation was still getting slower after 24h+ of uptime and after about 48h it resulted in
dockerd hanging and unable to launch any new container. This is GH issue: Dockerd slows down and finally hangs · Issue #43870 · moby/moby · GitHub. So what worked for me was downgrading from Ubuntu 22.04 + kernel 5.15-aws + cgroups_v2 to Ubuntu 20.04 + kernel 5.4-aws + cgroups_v1.
I also noticed that
nomad process uses far less CPU on this downgraded setup, previously it hovered around 60-120% of a VCPU and now it’s 15-40%.
Wow interesting, thanks for investigating and sticking with this, @aartur.
Are any of your Tasks making use of
resources.cores? On the Nomad side the way the
cpuset subsystem is managed did change significantly to support cgroups V2 - but I’m unsure if there would be noticeable impact without setting that option.