Hi folks!
We observed our Nomad Autoscaler (v0.3.5) scaling down unexpectedly during a brief period.
What we also saw was that right before scaling down, there were warning messages re: failed to ACK policy evaluation
.
A snippet of our logs can be found here:
2023-01-12T16:13:33.420Z [WARN] policy_eval.worker: failed to ACK policy evaluation: eval_id=<redacted> eval_token=<redacted> id=<redacted> policy_id=<redacted> queue=cluster error="evaluation ID not found"
2023-01-12T16:15:34.302Z [INFO] policy_eval.worker: scaling target: id=<redacted> queue=cluster target=aws-asg from=16 to=14 reason="scaling down because factor is 0.866009" meta=map[nomad_policy_id:<redacted>]
...
2023-01-12T16:22:12.792Z [WARN] policy_eval.worker: failed to ACK policy evaluation: eval_id=<redacted> eval_token=<redacted> id=<redacted> policy_id=<redacted> queue=cluster error="evaluation ID not found"
2023-01-12T16:24:13.795Z [INFO] policy_eval.worker: scaling target: id=<redacted> policy_id=<redacted> queue=cluster target=aws-asg from=14 to=13 reason="scaling down because factor is 0.862636" meta=map[nomad_policy_id:<redacted>]
Our current hypothesis is that:
- this
failed to ACK policy evaluation
error results in checks failing - which results in the Autoscaler scaling down, as per v0.3.5 (discussed here in a GitHub issue), before the adding of
on_error
configuration in v0.3.6 (GitHub PR)
However, with a limited understanding of Autoscaler under the hood, would this hypothesis be correct?
If this is correct indeed, what is the significance of the failed to ACK policy evaluation
error, and how can one prevent it?
Would you suggest upgrading to the latest @ v0.3.7 for instance?
Some details:
- Nomad Autoscaler version: v0.3.5
- scaling policy:
policy {
cooldown = "2m"
evaluation_interval = "1m"
check "cpu_allocated_percentage" {
source = "nomad-apm"
query = "percentage-allocated_cpu"
strategy "target-value" {
target = 70
}
}
check "mem_allocated_percentage" {
source = "nomad-apm"
query = "percentage-allocated_memory"
strategy "target-value" {
target = 70
}
}
}
Thank you folks!