Clarification on the nomad.client.allocs.cpu.total_percent metric for Docker driver

tyler-domitrovich · November 3, 2020, 10:17pm

Hello,

Recently I have been experimenting with the nomad autoscaler and encountered some unexpected behavior around CPU autoscaling. I deployed a test job with a scaling policy that used the nomad apm plugin and a target value set to 80. I then ran a load test against my test job expecting the autoscaler to scale up the number of allocations since the nomad UI reported allocations in my test job were using an average of 1500/500 Mhz of CPU, but no scaling actions occurred. Autoscaler logs revealed the CPU usage percentage never broke 40% despite the nomad UI showing 300% utilization.

After looking through the autoscaler docs I noticed that the nomad APM plugin uses the nomad.client.allocs.cpu.total_percent metric for CPU. For the docker driver, which I am using for tests, this metric appears to give the percentage of the host’s system CPU usage the allocation’s container used. This is the same approach used by the docker stats command. In other words, this metric shows the container’s usage of the host’s total resources instead of just the resources it is scheduled for.

My question is, is my understanding of this metric as it relates to the docker driver accurate? If so, it does not seem like an ideal metric to scale off of since it will differ based on the hardware configuration of the client the allocation is scheduled to.

lgfa29 · November 5, 2020, 2:40am

Hi @tyler-domitrovich

Yes, I think the percentage is a factor of the total host CPU, not the allocated CPU value, which is what the UI displays.

The Nomad APM plugin in the Autoscaler is pretty bare bones and not really the best for real usage. I would recommended trying to use Prometheus if possible. You can should get the same result from the UI using a query like:

nomad_client_allocs_cpu_total_ticks/nomad_client_allocs_cpu_allocated

You might need to filter by job and group and sum the value for all tasks and allocations.

But I think you make a good point and it’s something we should be able to support in the Nomad target. I went ahead and create an issue in our repo to track this.

tyler-domitrovich · November 5, 2020, 10:47pm

Awesome, thanks for confirming and opening that issue! In the meantime calculating the utilization with the above approach will work great for my use case. I am not currently using prometheus, but I am using Datadog and it seems I should be able to get the metrics I need using the datadog-apm plugin.

glennschmidt · November 23, 2020, 3:03am

Just trying to understand these metrics better - according to https://www.nomadproject.io/docs/operations/metrics
nomad_client_allocs_cpu_total_ticks is an integer while nomad_client_allocs_cpu_allocated is a percentage. If they are different units, doesn’t that mean we can’t compare them like you suggest?

Also it doesn’t say what nomad_client_allocs_cpu_allocated is a percentage of…presumably the total CPU available on the host?

lgfa29 · November 23, 2020, 3:28pm

Thank you for pointing this. The doc page is actually wrong, the unit for nomad_client_allocs_cpu_allocated is MHz, so it’s the same as nomad_client_allocs_cpu_total_ticks.

I opened a PR to correct this.

Topic		Replies	Views
Autoscaler and bounds nop scaling Nomad	0	151	July 31, 2023
Clarifications on nomad metrics values Nomad	4	294	April 23, 2024
Allocation scaling up to max limit and not scaling down when using Autoscaler and Nomad as apm Nomad	1	276	July 10, 2023
CPU usage on ARM processors Nomad	2	371	August 7, 2023
Cpu metrics values higher than expected Nomad	0	195	December 21, 2023

Clarification on the nomad.client.allocs.cpu.total_percent metric for Docker driver

Related topics