I’ve been thinking a lot about this one as it’s something we run in to quite a bit. Today, prestart (and post stops) will require dedicated resources as one might expect. However these resources stay allocated even after the action is run. For side cars this makes sense but for one time actions it now creates a scenario where node resources are allocated but won’t ever be used unless the group restarts
When there are a lot of groups/jobs each with their own prestart this can add up substantially. I was thinking, unless it’s a sidecar task the scheduler should release these resources (for pre start at least) back to the pool for allocation. If the container needs a restart and there aren’t enough resources then it’s fine that it moves nodes (imo)
I’d be curious what the general opinion is around this
Note we use HARD cpu limits to prevent containers spiking up and limiting host level processes so setting a soft option isn’t really an option for us. We need to protect the host cpu
Lets say I have a container that, on start up, may be configured to do some temporary but intensive tasks. Typically the container is allocated 128 CPU but in this case we need a temp burst of 512 CPU. We also have to use hard CPU allocations to prevent host vm abuse.
So now we have a few choices (yes, DAS would help with some of these but lets say that’s not an option for now)
Over provision the container to always be 512 CPU. Not terrible, until we consider that we may have 50 containers all with the same over provisioning. Starts to get expensive quickly.
OR
Use a prestart task with the 512 allocation, and then run the regular container at 128 allocation
Hope this sheds some light/context for discussion
Ian
is this the right forum for these kind of questions, or is it better to make a github issue with some sort of “design/discussion” flag?
This is a great place for them. And if a discussion generates a bug report or feature request we can always open a GitHub issue for it. I think you’ll find the Nomad engineering team isn’t quite as aggressive at answering questions here as we are on GitHub issues, just because we want to leave room for folks from the community to participate here. We also have a question label in GitHub, so whichever works best for you.
On to the issue at hand…
The scheduler should be “doing the right thing” inasmuch as it should be allocating the minimum amount of resources required for the entire allocation, taking into account what tasks are running concurrently due to lifecycle. So for an example using RAM resources:
Prestart Task
Main Task
Allocated
100MB (sidecar)
200MB
300MB
100MB (no sidecar)
200MB
200MB
200MB (sidecar)
100MB
300MB
200MB (no sidecar)
100MB
200MB
It looks like the last line in that table is the unfortunate case you’re running into?
I suspect that when we were designing that there was an assumption that in the common case the main task would require more resources. And it seems that we’re accounting for the entire allocation restarting, even though the only way that typically happens is if a user does a nomad alloc restart – the restart block of the jobspec controls the restart of tasks, not the whole alloc.
I pulled up the docs for lifecycle, the Learn Guide for Task Dependencies, and also the docs for resources and I see we’re definitely missing a description of the intention around prestart resources or the behavior of prestart tasks when the main task restarts. So I’ll open an issue for that documentation item for sure.
In this case, it’s minor that 64 is allocated and not freed up, but the scenario we are exploring is more extensive prestart tasks than main tasks so then the resource being allocated does matter more
EDIT: Posted more details on the github issue (with screenshots) so I’ll move over to that forum for now instead of double spamming you