Docker images not Garbage Collected

Hi!, I have been using Nomad for some time now and so far so good. Still, I have some questions regarding to Nomad GC.

Currently, I have long running jobs in my cluster with applications that get deployed by changing its docker image tag. After the change, Nomad schedules a rollout deployment, finishes it, and I have the new image running.

My issue comes while garbage collecting. I have been reading the documentation and if I am not wrong, Nomad will only GC dead/stopped jobs, and given that my jobs are always running, old images wont get cleaned up. Is there a way to fix this? I would like to delete old deployed images, as they are using too much space in my disks. I also wanted to avoid adding a cron in servers.

Best regards, Jose Fernandez.

Hi @magec! Each client keeps track of which Docker images its running jobs use. Each task launched increments the count, and each task that stops decrements the count. When the count reaches 0, the image is removed. So when you launch a new version of a job with an incremented image tag, Nomad should be cleaning up the old ones once those tasks have exited.

The docs about garbage collection are referring to the server’s view of the world, which doesn’t know anything about Docker images.

See here for more: https://github.com/hashicorp/nomad/blob/master/drivers/docker/driver.go#L1291

@magec Just a generic Docker question… if your images are incrementing versions of the same image, they would be sharing layers, so the “cleanup” would be just a soft cleanup without much disk space being reclaimed.

At this point I am not sure if by “not cleaned up”, you mean they are visible in “docker images”, or if “not much disk is freed up”.

Just as an experiment, could try launching two completely different images on a particular agent and confirm that the disk space is being indeed reclaimed (of the first image)?

HTH,
Shantanu

Ok, so as @tgross said, the expected behavior is having the image removed after the container is destroyed, that was one of the things I wanted to confirm, I think it is not clear in the documentation (not in generic, nor the docker driver one).

Still, this does not seem to be the case, but given that this is expected behavior, I am gathering some more information and will set up debug logging to some nodes to see whether I can figure out the issue.

@shantanugadgil with ‘not cleaned up’ yes, I mean they are visible in docker images. The whole layering thing helps, but does not fix it completely. I will try to debug this a little bit to see what is wrong.

Thanks for the help!

My findings so far is that Nomad is saying:

Apr 08 06:57:54 FILTERED nomad[32612]:     2020-04-08T06:57:54.924Z [WARN]  client.driver_mgr.docker: RemoveImage on non-referenced counted image id: driver=docker image_id=sha256:00e4ff2bfd2ec30be2f9c70f8d645f9d174f566e85155de071a5de78232386eb

But when I do a docker images I still see 00e4ff2bfd2e. That image is a candidate for deletion, is what we had before deploying. Maybe I am doing something wrong in deployment? I simply change the tag and submit the change.

Thanks in advance, Jose Fernandez

That’s interesting… that error message happens when Nomad tries to remove an image that it no longer has a count of. Do you have debug-level logs you could share from that Nomad client, specifically the ones tagged for client.driver_mgr.docker so we can see what’s happening before and after we get into this state?

In the end it was my fault, at the time Nomad Docker’s driver tried to GC the image, it was being used by another process. That prevented the image to be deleted. Nomad tries and forget (which is convenient), and in the end, the image is left out.

It works as expected then, no bug there. Thanks!