Hi,
We are trying to find the best approach to scaling the client nodes based on CPU utilization and allocation, and we are facing some challenges, or propper set the bounds for this scaling, and we are looking into using the nomad_nomad_job_summary_queued
metric, but we see this doesn’t have data for which node_class
.
So we have this basic scaling rule.
check "high-cpu-allocated" {
group = "cpu-allocated"
source = "prometheus"
query = "${cpu_query_allocated}"
strategy "threshold" {
lower_bound = 80
delta = 2
}
}
check "low-cpu-allocated" {
group = "cpu-allocated"
source = "prometheus"
query = "${cpu_query_allocated}"
strategy "threshold" {
upper_bound = 70
lower_bound = 60
delta = -1
}
}
And our problem is when the allocated CPU gets at 79.8 or 79.9(we understand this is below the bounds, and because of that this will not scale) so we are trying to understand or find the best approach on how we calculated the CPU allocated for each job deployed that we do not get at this weird state.
When we get to this state we start to see a lot of allocations on pending states because there is no “free cpu” to allocate more but this also does not scale because this is between bounds.
There are any other metrics we can use to improve this query? Or should we follow a different approach?
Thanks,