Clarifications on nomad metrics values

AAverin · April 11, 2024, 8:04am

I am looking into metrics and so far there is lots of confusion.
Here are some questions I got:

what is the difference between “system space” and “user space”, specifically metrics nomad.client.allocs.cpu.user and nomad.client.allocs.cpu.system. How do I know if my job is in system or in user? Do I need to sum them to get the cpu consumption by allocation?
nomad.client.allocs.cpu.allocated in in Mhz, but all other allocs.cpu metrics are in Percentage, according to documentation. How can I build an alarm that triggers when allocation CPU usage crosses available allocation CPU? This would be an indicator that task consumes more CPU then what was given.
I was expecting that:
nomad.client.host.cpu.total_percent = nomad.client.host.cpu.system + nomad.client.host.cpu.user but on my single core cpu it is never the case. Why is it like that? What am I missing?
I expected that:
sum of all nomad_client_allocs_memory_usage per allocation would equal nomad_client_host_memory_used , but it is also never the case. If I add to the calculation allocated memory per allocation, it also doesn’t add up really.
What metric can I use to see nomad internal client/host processes CPU/memory usage?

AAverin · April 11, 2024, 8:22am

A bit more specific example. Here is my grafanagent constantly consuming more than what it was given. I want to see the same in grafana and have an alert for such cases

Running a query nomad_client_allocs_cpu_system{instance="Monitoring", task="grafanaagent"} + nomad_client_allocs_cpu_user{instance="Monitoring", task="grafanaagent"} gives me

That doesn’t match Nomad UI

AAverin · April 11, 2024, 8:42am

Another example

vs

How does this work?

Kamilcuk · April 23, 2024, 9:01am

hi!

what is the difference between “system space” and “user space”, specifically metric

The difference is as on linux. See like User CPU time vs System CPU time? - Stack Overflow . Reseach linux cpu usage metrics.

but all other allocs.cpu metrics are in Percentage

What about nomad.client.allocs.cpu.total_ticks?

How can I build an alarm that triggers when allocation CPU usage crosses available allocation CPU?

I use prometheus, when the following is greater than 100%:

nomad_client_allocs_cpu_total_ticks{namespace=~"$namespace",instance=~"$client",exported_job=~${job:doublequote},task_group=~"$group",task=~"$task",alloc_id=~"$alloc_id"} * 100
/
nomad_client_allocs_cpu_allocated{namespace=~"$namespace",instance=~"$client",exported_job=~${job:doublequote},task_group=~"$group",task=~"$task",alloc_id=~"$alloc_id"}

Why is it like that? What am I missing?

See Linux CPU usage metrics. This is nothing specific to Nomad. See man proc, see /proc/stat documentation.

it also doesn’t add up really.

What about the kernel? What about I/O device buffers? Consider researching Linux memory.

What metric can I use to see nomad internal client/host processes CPU/memory usage?

I do not understand the question, what is “internal client” and “internal host” processes, and how do they differ from “external”? You might be interested in Zabbix or prometheus or nagios.

To monitor go process “internal” (i.e. metrics package - runtime/metrics - Go Packages) of the Nomad process itself, I use nomad_runtime_alloc_bytes and nomad_runtime_heap_objects.

AAverin · April 23, 2024, 11:59am

Thanks a lot for the detailed reply, I will make sure to read through the references and see if it resolves all my questions

Topic		Replies	Views
Nomad client allocation memory stats from telemetry seems confusing Nomad consul-nomad	3	1082	September 22, 2023
Official grafana dashboard Nomad	2	417	April 8, 2024
Where is metrics used by Nomad web ui? Nomad	1	465	November 2, 2022
Clarification on the nomad.client.allocs.cpu.total_percent metric for Docker driver Nomad	4	3188	November 23, 2020
Cpu metrics values higher than expected Nomad	0	195	December 21, 2023

Clarifications on nomad metrics values

Related topics