Prestart tasks (resource allocation)

I’ve been thinking a lot about this one as it’s something we run in to quite a bit. Today, prestart (and post stops) will require dedicated resources as one might expect. However these resources stay allocated even after the action is run. For side cars this makes sense but for one time actions it now creates a scenario where node resources are allocated but won’t ever be used unless the group restarts

When there are a lot of groups/jobs each with their own prestart this can add up substantially. I was thinking, unless it’s a sidecar task the scheduler should release these resources (for pre start at least) back to the pool for allocation. If the container needs a restart and there aren’t enough resources then it’s fine that it moves nodes (imo)

I’d be curious what the general opinion is around this

Note we use HARD cpu limits to prevent containers spiking up and limiting host level processes so setting a soft option isn’t really an option for us. We need to protect the host cpu

Thanks!
Ian

expanding on the use case a little.

Lets say I have a container that, on start up, may be configured to do some temporary but intensive tasks. Typically the container is allocated 128 CPU but in this case we need a temp burst of 512 CPU. We also have to use hard CPU allocations to prevent host vm abuse.

So now we have a few choices (yes, DAS would help with some of these but lets say that’s not an option for now)

  • Over provision the container to always be 512 CPU. Not terrible, until we consider that we may have 50 containers all with the same over provisioning. Starts to get expensive quickly.
    OR
  • Use a prestart task with the 512 allocation, and then run the regular container at 128 allocation

Hope this sheds some light/context for discussion :slight_smile:
Ian

@tgross - QQ - is this the right forum for these kind of questions, or is it better to make a github issue with some sort of “design/discussion” flag?

It’s the discuss forum, so it might be the right place. :blush:

2 Likes

True - there’s also a LOT of unanswered items here :slight_smile:

I think the team was on a well-deserved Christmas vacation. Your questions will certainly be picked up and answered. Some patience.

1 Like

Um - I think you misunderstand my post but thanks for your feedback.

Nothing negative intended to anyone just a discussion on the right place for design topics.

Hi @idrennanvmware!

is this the right forum for these kind of questions, or is it better to make a github issue with some sort of “design/discussion” flag?

This is a great place for them. And if a discussion generates a bug report or feature request we can always open a GitHub issue for it. I think you’ll find the Nomad engineering team isn’t quite as aggressive at answering questions here as we are on GitHub issues, just because we want to leave room for folks from the community to participate here. We also have a question label in GitHub, so whichever works best for you.

On to the issue at hand…

The scheduler should be “doing the right thing” inasmuch as it should be allocating the minimum amount of resources required for the entire allocation, taking into account what tasks are running concurrently due to lifecycle. So for an example using RAM resources:

Prestart Task Main Task Allocated
100MB (sidecar) 200MB 300MB
100MB (no sidecar) 200MB 200MB
200MB (sidecar) 100MB 300MB
200MB (no sidecar) 100MB 200MB

It looks like the last line in that table is the unfortunate case you’re running into?

I suspect that when we were designing that there was an assumption that in the common case the main task would require more resources. And it seems that we’re accounting for the entire allocation restarting, even though the only way that typically happens is if a user does a nomad alloc restart – the restart block of the jobspec controls the restart of tasks, not the whole alloc.

I pulled up the docs for lifecycle, the Learn Guide for Task Dependencies, and also the docs for resources and I see we’re definitely missing a description of the intention around prestart resources or the behavior of prestart tasks when the main task restarts. So I’ll open an issue for that documentation item for sure.

1 Like

Opened document resource scheduling for prestart tasks · Issue #9725 · hashicorp/nomad · GitHub

1 Like

Also, docs: more documentation for lifecycle stanza by cgbaker · Pull Request #9693 · hashicorp/nomad · GitHub has more documentation on the lifecycle behaviors around restarts, but that hasn’t been pushed to the website yet.

Thanks @tgross! So here’s a real example of what we see (using the UI)

We have a Zookeeper Deployment that consists of the following (in the same group)

1x Prestart Task NO sidecar (64 CPU)
1x Task (64 CPU)
1x Task (100 CPU)
1x Task (512 CPU)

Nomad UI reports reserved CPU: 740

In this case, it’s minor that 64 is allocated and not freed up, but the scenario we are exploring is more extensive prestart tasks than main tasks so then the resource being allocated does matter more

EDIT: Posted more details on the github issue (with screenshots) so I’ll move over to that forum for now instead of double spamming you :slight_smile: