Hi everyone,
I’ve been trying to setup a grafana dashboard to show the telemetry data of nomad client nodes. But I am not sure of the correctness of my PromQL queries. I’ve followed the metric reference document and followed some references but I am still not sure about the correctness because the cpu/memory stats shown on the nomad UI is different from my grafana panels
nomad-version:
Nomad v1.8.2
BuildDate 2024-07-16T08:50:09Z
Revision 7f0822c1e4f25907d9f60e2d595411950dd1bd28
Nomad-agent telemetry block:
telemetry {
collection_interval = "1s"
disable_hostname = true
prometheus_metrics = true
publish_allocation_metrics = true
publish_node_metrics = true
}
I am using victoriametrics to store the prometheus formatted data pushed by nomad.
Can someone help me to verify whether these queries are correct or not.
For future reference, these are dahsboard variables:
- cluster: Only one option
local
- instance: specifies the client node
Queries
1. Current cpu utilization of client node
sum by (cluster,instance) (nomad_client_host_cpu_total{cluster="$cluster",instance="$client_id"}) / (sum by (cluster, instance) (nomad_client_host_cpu_total{cluster="$cluster",instance="$client_id"}) + sum by (cluster, instance) (nomad_client_host_cpu_idle{cluster="$cluster",instance="$client_id"}))
2. Current memory utilization of client node
sum by (cluster, instance) (nomad_client_host_memory_used{cluster="$cluster", instance="$client_id"}) / (sum by (cluster, instance) (nomad_client_host_memory_total{cluster="$cluster", instance="$client_id"}))
3. Current Disk utilization of client node
sum by (cluster, instance) (nomad_client_host_disk_used{cluster="$cluster", instance="$client_id"}) / (sum by (cluster, instance) (nomad_client_host_disk_size{cluster="$cluster", instance="$client_id"}))
4. % CPU shares allocated
This specifies how much cpu is allocated to all the jobs combined
sum by (cluster, instance) (nomad_client_allocated_cpu{cluster="$cluster", instance="$client_id"}) / (sum by (cluster, instance) (nomad_client_allocated_cpu{cluster="$cluster", instance="$client_id"}) + sum by (cluster, instance) (nomad_client_unallocated_cpu{cluster="$cluster", instance="$client_id"}))
5. % CPU utilization
This specifies how much cpu is being utilized out of allocated to all the jobs combined
sum by (cluster,instance) (nomad_client_allocs_cpu_total_ticks{cluster="$cluster",instance="$client_id"}) / sum by (cluster, instance) (nomad_client_allocs_cpu_allocated{cluster="$cluster",instance="$client_id"})
Few refs: ref1, ref2
6. CPU allocated (MHz)
This is the CPU allocated to all the jobs.
sum by (cluster, instance) (nomad_client_allocated_cpu{cluster="$cluster", instance="$client_id"})
7. CPU utilization (Mhz)
This is the CPU being utilized by all the jobs
sum by (cluster,instance) (nomad_client_allocs_cpu_total_ticks{cluster="$cluster",instance="$client_id"})
8. % Memory allocated
sum by (cluster, instance) (nomad_client_allocated_memory{cluster="$cluster", instance="$client_id"}) / (sum by (cluster, instance) (nomad_client_allocated_memory{cluster="$cluster", instance="$client_id"}) + sum by (cluster, instance) (nomad_client_unallocated_memory{cluster="$cluster", instance="$client_id"}))
9. % Memory utilization
sum by (cluster, instance)(nomad_client_allocs_memory_usage{cluster="$cluster", instance="$client_id"}) / sum by (cluster, instance)(nomad_client_allocs_memory_allocated {cluster="$cluster", instance="$client_id"})
10. Memory allocated (Bytes)
This is the memory allocated to all the jobs combined
sum by (cluster, instance) (nomad_client_allocs_memory_allocated{cluster="$cluster", instance="$client_id"})
11. Memory Utilization (Bytes)
This is the memory being utilized by all the jobs combined
sum by (cluster,instance) (nomad_client_allocs_memory_usage {cluster="$cluster",instance="$client_id"})
I am finding discrepancy in the 1st and 2nd query.
- Nomad UI
- Grafana panel