Alloc directory is getting removed immediately after Workload is completed/failed

krundru · November 7, 2022, 5:55am

Is it expected to clean up the alloc directory immediately after the workload is completed/failed?
When testing with new job configurations, some of the jobs fail, and we cannot triage them due to a lack of alloc logs. The click of alloc files on UI takes me to a 404 page. Losing access to all log files makes it difficult to triage the issue.
This issue makes things even worst if the restart & reschedule are set to 0.

So, Is there a way to preserve the alloc dir after the alloc ends? Is it something misconfiguration with the client?

seth.hoenig · November 7, 2022, 3:48pm

Hi @krundru the cleanup of allocations sandboxes is tunable via client configuration, look for the options starting with gc_ [1]

However considering your allocations are so recent, I’m wondering if it’s actually a problem with the setting up the sandbox in the first place - e.g. failing to download an image or something like that. You can look at the Recent Events of the allocation to see what’s going on there, e.g.

➜ nomad alloc status 26 | grep -A5 "Recent Events"
Recent Events:
Time                       Type             Description
2022-11-07T09:46:37-06:00  Alloc Unhealthy  Unhealthy because of failed task
2022-11-07T09:46:33-06:00  Not Restarting   Exceeded allowed attempts 2 in interval 30m0s and mode is "fail"
2022-11-07T09:46:33-06:00  Driver Failure   Failed to pull `shoenig/simple-http:does-not-exist`: API error (404): manifest for shoenig/simple-http:does-not-exist not found: manifest unknown: manifest unknown
2022-11-07T09:46:32-06:00  Driver           Downloading image

[1] client Stanza - Agent Configuration | Nomad | HashiCorp Developer

krundru · November 8, 2022, 6:42am

thanks @seth.hoenig for responding on this issue.

Couple of things are,
This is a real nomad cluster with client running on AWS EC2 instance and we didn’t specific any gc configurations for client.

When looked at the Disk usage, I found 92% used and left with 650MB free space. Do you think this is the reason?

nickwales1 · December 16, 2022, 12:29pm

Hi Krundru,

That is exactly the issue, Nomad will by default start garbage collecting allocations immediately if the clients disk is above 80%.

You’ll see a similar log entry to this when it’s happening:

client.gc: garbage collecting allocation: alloc_id=118a22a8-c186-546e-0f84-a1eb46a5d9d4 reason="disk usage of 83 is over gc threshold of 80"

The threshold for this can be managed using client Stanza - Agent Configuration | Nomad | HashiCorp Developer

Hope that helps.

Topic		Replies	Views
Windows nomad doesnt always cleanup old allocations Nomad nomad	0	107	July 12, 2024
Garbage collection Nomad	1	2960	November 18, 2019
Ensure alloc directories stay for some time Nomad	2	526	July 12, 2022
When GC disk threshold is exceeded allocations are instantly garbage collected, so you can't see job logs Nomad	0	408	October 8, 2021
Infinite memory growth on Nomad Client Nomad	1	518	October 27, 2022

Alloc directory is getting removed immediately after Workload is completed/failed

Related topics