I can't scrape Nomad cluster metrics and forward to Prometheus server

vorillaz · December 22, 2023, 6:32pm

Hi there,

I’ve spend more time than I would expect on trying to forward our exposed metrics to a Prometheus server. We use Nomad autodiscovery along with Traefik and Cloudflare for DNS management. Currently our cluster has telemetry options on, and I can see the metrics exposed to cluster.example.com/v1/metrics?format=prometheus

To my understanding we need to discover the metrics per client/server and then forward them to Prometheus. I think that deploying a standalone Prometheus server might be an overkill since we just need to forward the metrics.

I’ve successfully deployed a Vector job according to this post from the community: Nomad host logs and metrics using vector, Loki (Grafana cloud)

but as far as I can tell there are obviously some metrics missing since vector cannnot discover the metrics exposed by the Nomad cluster.

According to my understanding, if I want to use an agent I can move along with the victoria-metrics agent or the grafana agent.

So, I’ve also tried to deploy the victoria-metrics agent as a job with the Nomad autodiscovery configuration on. When the job gets deployed, the agent actually discovers the membeers of the cluster but when trying to scrape the metrics it fails with a 404 error.

I am pulling my hair apart cause I’ve been trying different configurations for the past 2 days and I can’t seem to find a solution. It seems that the agent is trying to scrape the metrics from raw IP addresses which are not accessible.

Here is the job definition: as a reference:

job "vmagent" {
  datacenters = ["dc1"]
  type        = "service"

  group "vmagent" {
    network {
      port "api" {
        to = 8429
      }
    }


    ephemeral_disk {
      size   = 500 # 500 MB
      sticky = true
    }

    update {
      auto_revert = true
    }


    task "vmagent" {
      driver = "docker"
      config {
        image = "victoriametrics/vmagent:latest"
        ports = ["api"]
        args = [
          "-envflag.enable",
          "-promscrape.config.strictParse=false",
          "-promscrape.config=${NOMAD_TASK_DIR}/vm-confing.yml",
          "-remoteWrite.maxDiskUsagePerURL=500MB",
          "-remoteWrite.url=https://<my-prometheus-instance>/api/v1/write",
          "-remoteWrite.basicAuth.username=admin",
          "-remoteWrite.basicAuth.password='password'"
        ]
      }

      template {
        data        = file(abspath("./prometheus.tpl.yml"))
        destination = "local/vm-confing.yml"
        change_mode = "restart"
      }

      service {
        provider = "nomad"
        port     = "api"
      }

      resources {
        cpu    = 256
        memory = 100
      }
    }
  }

}

and the template file:

global:
  scrape_interval: 2s
  evaluation_interval: 2s

scrape_configs:
  - job_name: "nomad_test"
    static_configs:
      - labels: { "cluster": "foo" }
    nomad_sd_configs:
      - server: "https://my-nomad-dashboard.com"
        authorization:
          credentials: "<nomad-token>"
        follow_redirects: true
        refresh_interval: 1m
        tls_config:
          insecure_skip_verify: true
    metrics_path: /v1/metrics
    # params:
    #   format: ["prometheus"]
    scrape_interval: 15s
    scrape_timeout: 5s
    relabel_configs:
      - source_labels: [__address__]s
        target_label: environment
        replacement: "staging"

Topic		Replies	Views
Prometheus / Consul autodiscovery / short-live containers -> searching for examples Nomad	3	301	February 5, 2024
Monitoring Nomad clusters by region Nomad prometheus	4	441	February 6, 2023
PromQL queries with telemetry Nomad prometheus	0	24	November 19, 2024
Nomad telemetry Nomad	3	1049	May 9, 2020
Can't scrape metrics with Nomad Agent TLS Enabled Nomad	2	695	December 27, 2021

I can't scrape Nomad cluster metrics and forward to Prometheus server

Related topics