Hi there,
I’ve spend more time than I would expect on trying to forward our exposed metrics to a Prometheus server. We use Nomad autodiscovery along with Traefik and Cloudflare for DNS management. Currently our cluster has telemetry options on, and I can see the metrics exposed to cluster.example.com/v1/metrics?format=prometheus
To my understanding we need to discover the metrics per client/server and then forward them to Prometheus. I think that deploying a standalone Prometheus server might be an overkill since we just need to forward the metrics.
I’ve successfully deployed a Vector job according to this post from the community: Nomad host logs and metrics using vector, Loki (Grafana cloud)
but as far as I can tell there are obviously some metrics missing since vector cannnot discover the metrics exposed by the Nomad cluster.
According to my understanding, if I want to use an agent I can move along with the victoria-metrics agent or the grafana agent.
So, I’ve also tried to deploy the victoria-metrics agent as a job with the Nomad autodiscovery configuration on. When the job gets deployed, the agent actually discovers the membeers of the cluster but when trying to scrape the metrics it fails with a 404 error.
I am pulling my hair apart cause I’ve been trying different configurations for the past 2 days and I can’t seem to find a solution. It seems that the agent is trying to scrape the metrics from raw IP addresses which are not accessible.
Here is the job definition: as a reference:
job "vmagent" {
datacenters = ["dc1"]
type = "service"
group "vmagent" {
network {
port "api" {
to = 8429
}
}
ephemeral_disk {
size = 500 # 500 MB
sticky = true
}
update {
auto_revert = true
}
task "vmagent" {
driver = "docker"
config {
image = "victoriametrics/vmagent:latest"
ports = ["api"]
args = [
"-envflag.enable",
"-promscrape.config.strictParse=false",
"-promscrape.config=${NOMAD_TASK_DIR}/vm-confing.yml",
"-remoteWrite.maxDiskUsagePerURL=500MB",
"-remoteWrite.url=https://<my-prometheus-instance>/api/v1/write",
"-remoteWrite.basicAuth.username=admin",
"-remoteWrite.basicAuth.password='password'"
]
}
template {
data = file(abspath("./prometheus.tpl.yml"))
destination = "local/vm-confing.yml"
change_mode = "restart"
}
service {
provider = "nomad"
port = "api"
}
resources {
cpu = 256
memory = 100
}
}
}
}
and the template file:
global:
scrape_interval: 2s
evaluation_interval: 2s
scrape_configs:
- job_name: "nomad_test"
static_configs:
- labels: { "cluster": "foo" }
nomad_sd_configs:
- server: "https://my-nomad-dashboard.com"
authorization:
credentials: "<nomad-token>"
follow_redirects: true
refresh_interval: 1m
tls_config:
insecure_skip_verify: true
metrics_path: /v1/metrics
# params:
# format: ["prometheus"]
scrape_interval: 15s
scrape_timeout: 5s
relabel_configs:
- source_labels: [__address__]s
target_label: environment
replacement: "staging"