Why did Nomad Autoscaler (v0.3.5) scale down unexpectedly

Hi folks!

We observed our Nomad Autoscaler (v0.3.5) scaling down unexpectedly during a brief period.
What we also saw was that right before scaling down, there were warning messages re: failed to ACK policy evaluation.

A snippet of our logs can be found here:

2023-01-12T16:13:33.420Z [WARN]  policy_eval.worker: failed to ACK policy evaluation: eval_id=<redacted> eval_token=<redacted> id=<redacted> policy_id=<redacted> queue=cluster error="evaluation ID not found"
2023-01-12T16:15:34.302Z [INFO]  policy_eval.worker: scaling target: id=<redacted> queue=cluster target=aws-asg from=16 to=14 reason="scaling down because factor is 0.866009" meta=map[nomad_policy_id:<redacted>]

...

2023-01-12T16:22:12.792Z [WARN]  policy_eval.worker: failed to ACK policy evaluation: eval_id=<redacted> eval_token=<redacted> id=<redacted> policy_id=<redacted> queue=cluster error="evaluation ID not found"
2023-01-12T16:24:13.795Z [INFO]  policy_eval.worker: scaling target: id=<redacted> policy_id=<redacted> queue=cluster target=aws-asg from=14 to=13 reason="scaling down because factor is 0.862636" meta=map[nomad_policy_id:<redacted>]

Our current hypothesis is that:

  • this failed to ACK policy evaluation error results in checks failing
  • which results in the Autoscaler scaling down, as per v0.3.5 (discussed here in a GitHub issue), before the adding of on_error configuration in v0.3.6 (GitHub PR)

However, with a limited understanding of Autoscaler under the hood, would this hypothesis be correct?

If this is correct indeed, what is the significance of the failed to ACK policy evaluation error, and how can one prevent it?
Would you suggest upgrading to the latest @ v0.3.7 for instance?

Some details:

  • Nomad Autoscaler version: v0.3.5
  • scaling policy:
      policy {
        cooldown = "2m"
        evaluation_interval = "1m"
        check "cpu_allocated_percentage" {
          source = "nomad-apm"
          query  = "percentage-allocated_cpu"
          strategy "target-value" {
            target = 70
          }
        }
        check "mem_allocated_percentage" {
          source = "nomad-apm"
          query = "percentage-allocated_memory"
          strategy "target-value" {
            target = 70
          }
        }
      }  

Thank you folks! :bowing_man: