Clarifications on nomad metrics values

I am looking into metrics and so far there is lots of confusion.
Here are some questions I got:

  • what is the difference between “system space” and “user space”, specifically metrics nomad.client.allocs.cpu.user and nomad.client.allocs.cpu.system. How do I know if my job is in system or in user? Do I need to sum them to get the cpu consumption by allocation?

  • nomad.client.allocs.cpu.allocated in in Mhz, but all other allocs.cpu metrics are in Percentage, according to documentation. How can I build an alarm that triggers when allocation CPU usage crosses available allocation CPU? This would be an indicator that task consumes more CPU then what was given.

  • I was expecting that:
    nomad.client.host.cpu.total_percent = nomad.client.host.cpu.system + nomad.client.host.cpu.user but on my single core cpu it is never the case. Why is it like that? What am I missing?

  • I expected that:
    sum of all nomad_client_allocs_memory_usage per allocation would equal nomad_client_host_memory_used , but it is also never the case. If I add to the calculation allocated memory per allocation, it also doesn’t add up really.

  • What metric can I use to see nomad internal client/host processes CPU/memory usage?

A bit more specific example. Here is my grafanagent constantly consuming more than what it was given. I want to see the same in grafana and have an alert for such cases

Running a query nomad_client_allocs_cpu_system{instance="Monitoring", task="grafanaagent"} + nomad_client_allocs_cpu_user{instance="Monitoring", task="grafanaagent"} gives me

That doesn’t match Nomad UI

Another example

vs

How does this work?

hi!

what is the difference between “system space” and “user space”, specifically metric

The difference is as on linux. See like User CPU time vs System CPU time? - Stack Overflow . Reseach linux cpu usage metrics.

but all other allocs.cpu metrics are in Percentage

What about nomad.client.allocs.cpu.total_ticks?

How can I build an alarm that triggers when allocation CPU usage crosses available allocation CPU?

I use prometheus, when the following is greater than 100%:

nomad_client_allocs_cpu_total_ticks{namespace=~"$namespace",instance=~"$client",exported_job=~${job:doublequote},task_group=~"$group",task=~"$task",alloc_id=~"$alloc_id"} * 100
/
nomad_client_allocs_cpu_allocated{namespace=~"$namespace",instance=~"$client",exported_job=~${job:doublequote},task_group=~"$group",task=~"$task",alloc_id=~"$alloc_id"}

Why is it like that? What am I missing?

See Linux CPU usage metrics. This is nothing specific to Nomad. See man proc, see /proc/stat documentation.

it also doesn’t add up really.

What about the kernel? What about I/O device buffers? Consider researching Linux memory.

What metric can I use to see nomad internal client/host processes CPU/memory usage?

I do not understand the question, what is “internal client” and “internal host” processes, and how do they differ from “external”? You might be interested in Zabbix or prometheus or nagios.

To monitor go process “internal” (i.e. metrics package - runtime/metrics - Go Packages) of the Nomad process itself, I use nomad_runtime_alloc_bytes and nomad_runtime_heap_objects.

Thanks a lot for the detailed reply, I will make sure to read through the references and see if it resolves all my questions