Alloc directory is getting removed immediately after Workload is completed/failed

Is it expected to clean up the alloc directory immediately after the workload is completed/failed?
When testing with new job configurations, some of the jobs fail, and we cannot triage them due to a lack of alloc logs. The click of alloc files on UI takes me to a 404 page. Losing access to all log files makes it difficult to triage the issue.
This issue makes things even worst if the restart & reschedule are set to 0.

So, Is there a way to preserve the alloc dir after the alloc ends? Is it something misconfiguration with the client?



Hi @krundru the cleanup of allocations sandboxes is tunable via client configuration, look for the options starting with gc_ [1]

However considering your allocations are so recent, I’m wondering if it’s actually a problem with the setting up the sandbox in the first place - e.g. failing to download an image or something like that. You can look at the Recent Events of the allocation to see what’s going on there, e.g.

➜ nomad alloc status 26 | grep -A5 "Recent Events"
Recent Events:
Time                       Type             Description
2022-11-07T09:46:37-06:00  Alloc Unhealthy  Unhealthy because of failed task
2022-11-07T09:46:33-06:00  Not Restarting   Exceeded allowed attempts 2 in interval 30m0s and mode is "fail"
2022-11-07T09:46:33-06:00  Driver Failure   Failed to pull `shoenig/simple-http:does-not-exist`: API error (404): manifest for shoenig/simple-http:does-not-exist not found: manifest unknown: manifest unknown
2022-11-07T09:46:32-06:00  Driver           Downloading image

[1] client Stanza - Agent Configuration | Nomad | HashiCorp Developer

thanks @seth.hoenig for responding on this issue.

Couple of things are,
This is a real nomad cluster with client running on AWS EC2 instance and we didn’t specific any gc configurations for client.

When looked at the Disk usage, I found 92% used and left with 650MB free space. Do you think this is the reason?

Hi Krundru,

That is exactly the issue, Nomad will by default start garbage collecting allocations immediately if the clients disk is above 80%.

You’ll see a similar log entry to this when it’s happening:

client.gc: garbage collecting allocation: alloc_id=118a22a8-c186-546e-0f84-a1eb46a5d9d4 reason="disk usage of 83 is over gc threshold of 80"

The threshold for this can be managed using client Stanza - Agent Configuration | Nomad | HashiCorp Developer

Hope that helps.