How to monitor restarts with nomad_client_alloc_restarts

Hi there,

I’m trying to sort out monitoring of individual allocation tasks being restarted under the hood and I am seeing an odd behaviour which I can’t point my finger to where I got it wrong…

I’m trying to use the nomad_client_allocs_restart (which by the way is not documented) although it is clearly used in the code of my v1.4.2 nomad cluster.

When I manually trigger a restart (or I exec into a task and kill the process so it restarts), I can see this onto my prometheus metrics…

Question: If this is a counter (as it seems from the code), why this metric disappears from the metrics endpoint after a few seconds? If I restart again the same task, it comes back again with the value of 1 for a few seconds, and disappears again…

I reckon I obvioulsy missconfigured something, but it seems like a very basic metric to scrape and use…

A few things from the setup.
This is a single cluster, very simple setup (3 servers / 5 clients):
telemetry nomad clients stanza

telemetry {
  collection_interval        = "2s"
  disable_hostname           = true
  prometheus_metrics         = true
  publish_allocation_metrics = true
  publish_node_metrics       = true
}

prometheus scraping config

- job_name: 'nomad_servers'
    scrape_interval: 1s
    consul_sd_configs:
    - server: '{{ env "NOMAD_IP_prometheus_ui" }}:8500'
      services: ['nomad', 'nomad-client']
    relabel_configs:
    - source_labels: ['__meta_consul_tags']
      regex: '(.*)http(.*)'
      action: keep
    metrics_path: /v1/metrics
    params:
      format: ['prometheus']

I tried with different scrape_interval / collection_interval combinations, but does not seem to be related?

I’ve spent already way too long on figuring this out and I would definitely appreciate some insights!!! :wink: I’m going crazy!