Long start times of allocations

Hi,
I’m trying to understand what’s the reason behind allocations staying for 1-2 minutes in the Client Status = pending state. The allocations are for dispatched jobs, they use the docker driver.

This is sample client TRACE log:

022-07-18T06:34:47.232Z [TRACE] client.cpuset.v2: add allocation: name=pm-job-1/dispatch-1658126074-a8a797ce.main[0] id=8aa0ddd3-114e-6556-4725-0c428da894ea

2022-07-18T06:36:52.203Z [DEBUG] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=8aa0ddd3-114e-6556-4725-0c428da894ea task=main

So more than 2 minutes elapsed between add allocation and lifecycle start condition has been met. How can I check what Nomad was waiting for in that period?

The client is oversubscribed with CPU and runs about 250 allocs, but my understanding is that once the allocation is created, the resources are assigned and there are no further checks of real CPU usage etc.?

Disk I/O is not exhausted, the Docker containers start fast using docker command (also Nomad logs indicate that once start condition has been met, a container is created quickly), nomad process CPU usage is about 60% of a VCPU.

My guess is this was caused by overloading cgroups manager on the kernel side. I captured nomad client profile and almost all time was spent on writing to cgroups files during allocs cleanups. And it looks like generally the throughput of cgroups operations the Linux kernel is able to sustain is rather low (from a few to tens of operations per second).

My fix was reducing the overallocation of CPU (to reduce the number of allocs running) and adding cgroup.memory=nokmem,nosocket to the kernel parameters to make cgroups operations faster. So far it looks to be working well after about 24h, the start times are 1-2 seconds.

The container creation was still getting slower after 24h+ of uptime and after about 48h it resulted in dockerd hanging and unable to launch any new container. This is GH issue: Dockerd slows down and finally hangs · Issue #43870 · moby/moby · GitHub. So what worked for me was downgrading from Ubuntu 22.04 + kernel 5.15-aws + cgroups_v2 to Ubuntu 20.04 + kernel 5.4-aws + cgroups_v1.

I also noticed that nomad process uses far less CPU on this downgraded setup, previously it hovered around 60-120% of a VCPU and now it’s 15-40%.

1 Like

Wow interesting, thanks for investigating and sticking with this, @aartur.

Are any of your Tasks making use of resources.cores? On the Nomad side the way the cpuset subsystem is managed did change significantly to support cgroups V2 - but I’m unsure if there would be noticeable impact without setting that option.