Infinite memory growth on Nomad Client


I was quite confused to discover that it is reasonably easy to compel Nomad Agent to allocation unbound amount of memory (and consequently do the same to Nomad Server). Here is the scenario that I am running with links to code backing my claims and some measurements.

Assume a simple job with one task group and one task. Assume that this job is updated every 1 minute and does the equivalent of sleep infty otherwise. The job update takes in idempotency token to ensure that we’re not creating new instances.

As a result of such updates we start to accumulate allocs – one per update. Within a day we will accumulate some 1440 of them. This bloats memory of both the server and the client. My preliminary measurement shows around ~1GB of RSS as reported by nomad.client.nomad.runtime.alloc_bytes for around 2k allocations.

Now, one might think that these allocs are cleaned up due to the Nomad Client settings:

However, this knob only controls the garbage collection of on-disk artefacts associated with the allocations and the metadata. The GC is here:

And the metadata cleanup is here:

removeAlloc does not have other callsites and may only be triggered by one of three events:

  1. The scheduler telling the agent that the allocs no longer exist (because the job was GCed)
  2. Someone manually purges the allocs via nomad system gc
  3. The Nomad Agent is lost and the new Agent does not receive the history

Ideally, I would like to be able to control the garbage collection of alloc metadata as well as data within Nomad Agent. I do not want to make it possible for someone to cause memory using correct APIs. Hence my two questions:

  1. By what is the decision to keep infinite alloc history motivated?
  2. Is there a setting I am missing which would constrain the absolute history per Nomad Client or per Job?

If the answer is: “avoid idempotency tokens” that would be valuable information as well

It seems that maybe the intention here was to clean up such allocs:

But submitting jobs with new versions isn’t actually moving the create index at all. So the allocs are now cleaned up