Clarifying Nomad Autoscaler's Target Value Strategy Plugin behaviour

Hi folks! :wave:

We are using Nomad Autoscaler (v0.3.5), and using the Target Value Strategy plugin as part of our scaling policy.

In particular, our scaling policy looks like this:

    scaling "cluster_policy" {
      enabled = true
      min = 3
      max = ...

      policy {
        cooldown = "2m"
        evaluation_interval = "1m"
        on_check_error = "fail"
        check "cpu_allocated_percentage" {
          source = "nomad-apm"
          query  = "percentage-allocated_cpu"
          strategy "target-value" {
            target = 70
          }
        }
        check "mem_allocated_percentage" {
          source = "nomad-apm"
          query = "percentage-allocated_memory"
          strategy "target-value" {
            target = 70
          }
        }
      }
    }

We are thus using the default threshold value (0.01) for this Target Value Strategy plugin.

As such, I understand the Nomad Autoscaler will thus scale up if the factor is 1.01 (1 + 0.01), or scale down if the factor is 0.99 (1 - 0.01).

We noticed that the Autoscaler “quickly” scaled down just 3 minutes after scaling up in our logs:

2023-03-18T01:00:35.690Z [INFO]  policy_eval.worker: scaling target: id=xxx policy_id=xxx queue=cluster target=aws-asg from=3 to=4 reason="scaling up because factor is 1.311829" meta=map[nomad_policy_id:xxx]
2023-03-18T01:00:46.210Z [INFO]  internal_plugin.aws-asg: successfully performed and verified scaling out: action=scale_out asg_name=xxx desired_count=4
2023-03-18T01:03:35.789Z [INFO]  policy_eval.worker: scaling target: id=xxx policy_id=xxx queue=cluster target=aws-asg from=4 to=3 reason="capped count from 1 to 3 to stay within limits" meta="map[nomad_autoscaler.count.capped:true nomad_autoscaler.count.original:1 nomad_autoscaler.reason_history:[scaling down because factor is 0.215126] nomad_policy_id:xxx]"

I understand from our policy, due to the cooldown (2m), it will suspend policy evaluations until 2 minutes has passed.
In the 3rd minute, I can see the scaling down then happened since it is 1 minute after the cooldown and our evaluation_interval is set 1m.

I also understand the scale-down happened since our factor was calculated to be 0.215

At this point, all looks expected, as I visualized the timeline:

Time Event Remarks
2023-03-18T01:00:35 Scaling up from 3 → 4 due to factor 1.311 allocated CPU and/or memory is at ~ 92% (factor = 91.8/70 = 1.311)
2023-03-18T01:00:46 Successfully scaled out -
2023-03-18T01:02:35 Cooling cooldown = 2m
2023-03-18T01:03:35 Scaling down from 4 → 3 (min) due to factor 0.215 allocated CPU and/or memory is at ~ 15% (factor = 15.05/70 = 0.215)

Question why did the factor drop so drastically from 1.311 → 0.215 within the next evaluation?

I understand we are using the default query_window of 1m.

If I understand correctly, this is because the results from the query_window (1m) showed low CPU / memory allocated.

My current (uninformed) hypothesis is that:

  • Because the new (4th) Nomad node spun up has low-to-zero CPU and memory allocated, bringing the metrics now down

Would that be plausible?

As a possible solution, would increasing the query_window to be > 1m (say 5m) be better for evaluating the CPU & memory metrics?
I am hoping I understand correctly that increasing this query_window can help to reduce the “aggressiveness” of the scaling here.

Thank you so much folks!