Hi folks!
We are using Nomad Autoscaler (v0.3.5), and using the Target Value Strategy plugin as part of our scaling policy.
In particular, our scaling policy looks like this:
scaling "cluster_policy" {
enabled = true
min = 3
max = ...
policy {
cooldown = "2m"
evaluation_interval = "1m"
on_check_error = "fail"
check "cpu_allocated_percentage" {
source = "nomad-apm"
query = "percentage-allocated_cpu"
strategy "target-value" {
target = 70
}
}
check "mem_allocated_percentage" {
source = "nomad-apm"
query = "percentage-allocated_memory"
strategy "target-value" {
target = 70
}
}
}
}
We are thus using the default threshold value (0.01) for this Target Value Strategy plugin.
As such, I understand the Nomad Autoscaler will thus scale up if the factor is 1.01 (1 + 0.01), or scale down if the factor is 0.99 (1 - 0.01).
We noticed that the Autoscaler “quickly” scaled down just 3 minutes after scaling up in our logs:
2023-03-18T01:00:35.690Z [INFO] policy_eval.worker: scaling target: id=xxx policy_id=xxx queue=cluster target=aws-asg from=3 to=4 reason="scaling up because factor is 1.311829" meta=map[nomad_policy_id:xxx]
2023-03-18T01:00:46.210Z [INFO] internal_plugin.aws-asg: successfully performed and verified scaling out: action=scale_out asg_name=xxx desired_count=4
2023-03-18T01:03:35.789Z [INFO] policy_eval.worker: scaling target: id=xxx policy_id=xxx queue=cluster target=aws-asg from=4 to=3 reason="capped count from 1 to 3 to stay within limits" meta="map[nomad_autoscaler.count.capped:true nomad_autoscaler.count.original:1 nomad_autoscaler.reason_history:[scaling down because factor is 0.215126] nomad_policy_id:xxx]"
I understand from our policy, due to the cooldown (2m), it will suspend policy evaluations until 2 minutes has passed.
In the 3rd minute, I can see the scaling down then happened since it is 1 minute after the cooldown and our evaluation_interval is set 1m.
I also understand the scale-down happened since our factor was calculated to be 0.215
At this point, all looks expected, as I visualized the timeline:
Time | Event | Remarks |
---|---|---|
2023-03-18T01:00:35 | Scaling up from 3 → 4 due to factor 1.311 | allocated CPU and/or memory is at ~ 92% (factor = 91.8/70 = 1.311) |
2023-03-18T01:00:46 | Successfully scaled out | - |
2023-03-18T01:02:35 | Cooling | cooldown = 2m |
2023-03-18T01:03:35 | Scaling down from 4 → 3 (min) due to factor 0.215 | allocated CPU and/or memory is at ~ 15% (factor = 15.05/70 = 0.215) |
Question why did the factor drop so drastically from 1.311 → 0.215 within the next evaluation?
I understand we are using the default query_window of 1m.
If I understand correctly, this is because the results from the query_window (1m) showed low CPU / memory allocated.
My current (uninformed) hypothesis is that:
- Because the new (4th) Nomad node spun up has low-to-zero CPU and memory allocated, bringing the metrics now down
Would that be plausible?
As a possible solution, would increasing the query_window to be > 1m (say 5m) be better for evaluating the CPU & memory metrics?
I am hoping I understand correctly that increasing this query_window can help to reduce the “aggressiveness” of the scaling here.
Thank you so much folks!