Recently I have been experimenting with the nomad autoscaler and encountered some unexpected behavior around CPU autoscaling. I deployed a test job with a scaling policy that used the nomad apm plugin and a target value set to 80. I then ran a load test against my test job expecting the autoscaler to scale up the number of allocations since the nomad UI reported allocations in my test job were using an average of 1500/500 Mhz of CPU, but no scaling actions occurred. Autoscaler logs revealed the CPU usage percentage never broke 40% despite the nomad UI showing 300% utilization.
After looking through the autoscaler docs I noticed that the nomad APM plugin uses the
nomad.client.allocs.cpu.total_percent metric for CPU. For the docker driver, which I am using for tests, this metric appears to give the percentage of the host’s system CPU usage the allocation’s container used. This is the same approach used by the
docker stats command. In other words, this metric shows the container’s usage of the host’s total resources instead of just the resources it is scheduled for.
My question is, is my understanding of this metric as it relates to the docker driver accurate? If so, it does not seem like an ideal metric to scale off of since it will differ based on the hardware configuration of the client the allocation is scheduled to.