Understanding autoscaler: evaluation_interval and cooldown how do the play together?

Hello all,

I want to understand what are the best settings for evaluation_interval and cooldown for an autoscaler that schedules a lot of batch jobs.

In the examples for the batch job, I can see that the evaluation_interval is 10s and cooldown 1m. Does this allow for better conflict resolution? Checks | Nomad | HashiCorp Developer

I imagine that several evaluations are “merged” during the cooldown period. So when a burst of requests arrive the request to the plugin

We use the pass-through plugin to create an instance per n allocation on pending. But during some burst of requests the autoscaler starts timeout when draining some instances. That is strange, since there are a lot of pending jobs.

My hypothesis is that my current conf cooldown == evaluation_interval breaks the autoscaler.

Hi @aleperalta,

The evaluation_interval determines how often the autoscaling policy is executed. If this configuration parameter is set to 10s, the policy will be processed by a worker every 10 seconds to determine whether any scaling action is required to reconcile the state based on the input metrics, current allocation number, and strategy.

When a scaling action is triggered and the count of a task group is changed, the autoscaling policy is put into cooldown. If the policy cooldown is set to 1m, the Nomad Autoscaler will not evaluate that scaling policy for at least 1 minute, and for at most 1 minute and 10 seconds. This is used to avoid flapping and to ensure new allocations have the required time to start/stop and for the metrics used during scaling decisions to be updated.

cooldown == evaluation_interval breaks the autoscaler

Although that is not a configuration I have seen before, I don’t expect it should break anything as the two options do not have a direct impact on each other.

during some burst of requests the autoscaler starts timeout when draining some instances

If you able to provide more context on this and any debug information I am keen to help resolve this issue you’re seeing.

Thanks,
jrasell and the Nomad team

Hi jrasell,

Thank you for your response.

I’ll look into provide more debug information. Could you provide a doc, link as to add more debugging info? Or some logs are enough?

Our task group fit into one client only, due to the resources constraints. We count pending allocation and start a new client using datadog. One allocation per client. We use pass through in the metrics.

Our current configuration works. Except when there’s a burst of request +100 requests. We lift +100 instances. In that moment the auto scaler starts timing out when communicating with the agent when draining instances.

We have the cooldown on 30s and evaluation_interval on 30s.

Thanks again.

Another question, what do you mean by flapping and how are the metrics updated?

If we scale on pending allocations, say we have 4 pendings, and then 2 more during the cooldown are we going scale by 6 or are there two scale out events.

Or some logs are enough?

This would be a good start, ideally at debug level to provide additional context around the important log lines.

what do you mean by flapping and how are the metrics updated?

Essentially the cooldown ensures all metrics have updated to account for the new state and that new jobs/servers have had a chance to start and emit the metrics. If you’re using Prometheus for example, you would need to account for the scrape interval, otherwise the autoscaler might act on stale data and flap between states and numbers of jobs/servers.

Flapping in the sense where jobs/servers are continually scaled up and down because of strategy calculations is something different and not relevant here as you’re using absolute values.

Thanks,
jrasell and the Nomad team

Hi jrasell,

I’m sorry I couldn’t reply with logs.
Your answer did really help to understand the autoscaler.

We did some experiments tweaking those values and we came to the conclusion, that turned to a question. Does the autoscaler and the scheduler have a race condition in such a way that the autoscaler can set to drain a node where the client picked up an allocation to run?

We are scaling out or in on pending + running allocations. Just like the example here On-demand Batch Job Cluster Autoscaling | Nomad | HashiCorp Developer but we seen a job allocating in a node that’s immediately set to drain.

Just wanted to confirm if that is expected