After an upgrade (v0.10.5 to 1.6.1) I’ve recently seen on several hosts (nomad clients) messages like
2023-08-15T14:17:12.275Z [ERROR] agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -2242.5742
agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -2242.5742
and also can sometimes get responses similar to
Host Resource Utilization
CPU Memory Disk
-472/120000 MHz 6.8 GiB/376 GiB (/dev/mapper/encryptedvol)
when running
nomad node status -self
Has anyone experienced similar behaviour?
It doesn’t seem to matter if there’s workloads on the host or not as our monitoring shows these error messages reported across several hosts.
All hosts are running
# uname -a
Linux <hostname> 4.19.0-20-amd64 #1 SMP Debian 4.19.235-1 (2022-03-17) x86_64 GNU/Linux
I traced through nomad’s code at tag 1.6.1 and found these result from queries to /proc/stat on linux with some code to calculate percentages from the change in jiffies (https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/client/stats/cpu.go#L133). This code has been moved in a recent commit but I believe it is functionally the same.
Any thoughts/ suggestions welcome!