Hi there,
I’m trying to sort out monitoring of individual allocation tasks being restarted under the hood and I am seeing an odd behaviour which I can’t point my finger to where I got it wrong…
I’m trying to use the nomad_client_allocs_restart
(which by the way is not documented) although it is clearly used in the code of my v1.4.2 nomad cluster.
When I manually trigger a restart (or I exec into a task and kill the process so it restarts), I can see this onto my prometheus metrics…
Question: If this is a counter (as it seems from the code), why this metric disappears from the metrics endpoint after a few seconds? If I restart again the same task, it comes back again with the value of 1 for a few seconds, and disappears again…
I reckon I obvioulsy missconfigured something, but it seems like a very basic metric to scrape and use…
A few things from the setup.
This is a single cluster, very simple setup (3 servers / 5 clients):
telemetry nomad clients stanza
telemetry {
collection_interval = "2s"
disable_hostname = true
prometheus_metrics = true
publish_allocation_metrics = true
publish_node_metrics = true
}
prometheus scraping config
- job_name: 'nomad_servers'
scrape_interval: 1s
consul_sd_configs:
- server: '{{ env "NOMAD_IP_prometheus_ui" }}:8500'
services: ['nomad', 'nomad-client']
relabel_configs:
- source_labels: ['__meta_consul_tags']
regex: '(.*)http(.*)'
action: keep
metrics_path: /v1/metrics
params:
format: ['prometheus']
I tried with different scrape_interval
/ collection_interval
combinations, but does not seem to be related?
I’ve spent already way too long on figuring this out and I would definitely appreciate some insights!!! I’m going crazy!