Hi!, I have been using Nomad for some time now and so far so good. Still, I have some questions regarding to Nomad GC.
Currently, I have long running jobs in my cluster with applications that get deployed by changing its docker image tag. After the change, Nomad schedules a rollout deployment, finishes it, and I have the new image running.
My issue comes while garbage collecting. I have been reading the documentation and if I am not wrong, Nomad will only GC dead/stopped jobs, and given that my jobs are always running, old images wont get cleaned up. Is there a way to fix this? I would like to delete old deployed images, as they are using too much space in my disks. I also wanted to avoid adding a cron in servers.
Hi @magec! Each client keeps track of which Docker images its running jobs use. Each task launched increments the count, and each task that stops decrements the count. When the count reaches 0, the image is removed. So when you launch a new version of a job with an incremented image tag, Nomad should be cleaning up the old ones once those tasks have exited.
The docs about garbage collection are referring to the server’s view of the world, which doesn’t know anything about Docker images.
@magec Just a generic Docker question… if your images are incrementing versions of the same image, they would be sharing layers, so the “cleanup” would be just a soft cleanup without much disk space being reclaimed.
At this point I am not sure if by “not cleaned up”, you mean they are visible in “docker images”, or if “not much disk is freed up”.
Just as an experiment, could try launching two completely different images on a particular agent and confirm that the disk space is being indeed reclaimed (of the first image)?
Ok, so as @tgross said, the expected behavior is having the image removed after the container is destroyed, that was one of the things I wanted to confirm, I think it is not clear in the documentation (not in generic, nor the docker driver one).
Still, this does not seem to be the case, but given that this is expected behavior, I am gathering some more information and will set up debug logging to some nodes to see whether I can figure out the issue.
@shantanugadgil with ‘not cleaned up’ yes, I mean they are visible in docker images. The whole layering thing helps, but does not fix it completely. I will try to debug this a little bit to see what is wrong.
But when I do a docker images I still see 00e4ff2bfd2e. That image is a candidate for deletion, is what we had before deploying. Maybe I am doing something wrong in deployment? I simply change the tag and submit the change.
That’s interesting… that error message happens when Nomad tries to remove an image that it no longer has a count of. Do you have debug-level logs you could share from that Nomad client, specifically the ones tagged for client.driver_mgr.docker so we can see what’s happening before and after we get into this state?
In the end it was my fault, at the time Nomad Docker’s driver tried to GC the image, it was being used by another process. That prevented the image to be deleted. Nomad tries and forget (which is convenient), and in the end, the image is left out.