Nomad v0.12.6 repeatedly killed by oom_reaper after a few thousand completed batch jobs

johnp789 · October 22, 2020, 2:03am

Using 1 Nomad server (4GB memory) and 2 Nomad clients (2GB memory each) in a Vagrant VirtualBox cluster (ubuntu/focal64 boxes), I’m experimenting with running a batch job with 50,000 alpine echo tasks. The cluster makes steady progress until the total number of jobs completed hits about 4,000 or so. After that, progress grinds to a halt, with each Nomad client process getting killed by the oom_reaper and restarting.

I was hopeful that pull request #9093 would help here, since the problem it was reported to solve sounded very similar. However, v0.12.6 should have that PR, and it did not help with this issue.

Is there some additional management that has to be done to complete a large batch like this, or does this look like a bug or limitation in Nomad?

The oom_reaper logs end like this:

Oct 22 01:26:14 client-two kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/nomad-client.service,task=nomad,pid=214571,uid=0
Oct 22 01:26:14 client-two kernel: Out of memory: Killed process 214571 (nomad) total-vm:1894584kB, anon-rss:499036kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1372kB oom_score_adj:0
Oct 22 01:26:14 client-two kernel: oom_reaper: reaped process 214571 (nomad), now anon-rss:48kB, file-rss:0kB, shmem-rss:0kB
Oct 22 01:26:14 client-two sh[214570]: Killed
Oct 22 01:26:14 client-two systemd[1]: nomad-client.service: Main process exited, code=exited, status=137/n/a
Oct 22 01:26:14 client-two systemd[1]: nomad-client.service: Failed with result 'exit-code'.

This is the job file:

# alpineBatch.hcl
job "alpineBatch-50K" {
    datacenters = ["dc1"]
    type = "batch"
    group "alpines" {
        count = 50000
            volume "data" {
            type      = "host"
            read_only = false
            source    = "alpine"
        }
        task "alpineExample" {
            driver = "docker"
            config {
                image = "alpine:3"
                command = "sh"
                args = [ "-c", "echo `adjtimex | awk '/(time.tv_sec|time.tv_usec)/ { printf(\"%06d\", $2) }'` ${node.unique.name} ${env["NOMAD_ALLOC_INDEX"]} >> /data/alpineExample.log" ]
            }
            resources {
                memory = 64
            }
            volume_mount {
                volume      = "data"
                destination = "/data"
                read_only   = false
            }
        }
    }
}

shoenig · October 22, 2020, 2:40pm

Hi @johnp789 just a heads up, v0.12.6 was a security bugfix release and doesn’t contain the fix for #9093 - that will be coming very soon in Nomad’s v1.0 release.

johnp789 · October 23, 2020, 1:37pm

Thanks for pointing that out.

After building nomad v1.0.0-dev (8a90b7eb161a151875ca82050d2f028aa80c904a+CHANGES), I started the 50,000-task batch experiment again last night. This time, the clients seemed to stay healthy, up until around allocation index 9,620, but the nomad server VM had gone to load average >28 with 100% wa showing in top by the time I came back to the machine in the morning. The server VM was totally unresponsive and had to be forcibly shut down.

The allocation index written to alpineExample.log as a function of time looks like this. Something changes substantially at around 5,000 completed tasks.

nomad-batch

johnp789 · October 27, 2020, 10:40pm

I’ve tried the experiment again today, with Nomad v1.0.0-beta2 (3acb12bb809712ab2e63f0adc4a6422c2ade27da) and count = 15,000. It still encountered a problem with the nomad clients consuming all the memory on the client nodes after about 10,000 total jobs and getting killed by the oom_reaper.