Recently I have been experimenting with the nomad autoscaler and encountered some unexpected behavior around CPU autoscaling. I deployed a test job with a scaling policy that used the nomad apm plugin and a target value set to 80. I then ran a load test against my test job expecting the autoscaler to scale up the number of allocations since the nomad UI reported allocations in my test job were using an average of 1500/500 Mhz of CPU, but no scaling actions occurred. Autoscaler logs revealed the CPU usage percentage never broke 40% despite the nomad UI showing 300% utilization.
After looking through the autoscaler docs I noticed that the nomad APM plugin uses the nomad.client.allocs.cpu.total_percent metric for CPU. For the docker driver, which I am using for tests, this metric appears to give the percentage of the host’s system CPU usage the allocation’s container used. This is the same approach used by the docker stats command. In other words, this metric shows the container’s usage of the host’s total resources instead of just the resources it is scheduled for.
My question is, is my understanding of this metric as it relates to the docker driver accurate? If so, it does not seem like an ideal metric to scale off of since it will differ based on the hardware configuration of the client the allocation is scheduled to.
Yes, I think the percentage is a factor of the total host CPU, not the allocated CPU value, which is what the UI displays.
The Nomad APM plugin in the Autoscaler is pretty bare bones and not really the best for real usage. I would recommended trying to use Prometheus if possible. You can should get the same result from the UI using a query like:
Awesome, thanks for confirming and opening that issue! In the meantime calculating the utilization with the above approach will work great for my use case. I am not currently using prometheus, but I am using Datadog and it seems I should be able to get the metrics I need using the datadog-apm plugin.
Just trying to understand these metrics better - according to https://www.nomadproject.io/docs/operations/metrics nomad_client_allocs_cpu_total_ticks is an integer while nomad_client_allocs_cpu_allocated is a percentage. If they are different units, doesn’t that mean we can’t compare them like you suggest?
Also it doesn’t say what nomad_client_allocs_cpu_allocated is a percentage of…presumably the total CPU available on the host?