Hi folks,
I’m coming back to doing some work with Nomad, for our setup that centralises logs by sending them to a central Loki server. We use promtail to send logs for non-Nomad, non-docker processes, and we use a system vector job running on all hosts to annotate Docker logs. It looks like this:
// this job sets up vector on every node in the cluster,
// allowing all the docker logs to be collected and sent to Loki
job "vector" {
datacenters = ["dc1"]
# system job, runs on all nodes
type = "system"
//
update {
min_healthy_time = "10s"
healthy_deadline = "5m"
progress_deadline = "10m"
auto_revert = true
}
group "vector" {
count = 1
restart {
attempts = 3
interval = "10m"
delay = "30s"
mode = "fail"
}
network {
// api is used for health checks
port "api" {
to = 8686
host_network = "my-private-network"
}
}
// declare a volume to mount the docker socket
volume "docker-sock" {
type = "host"
source = "docker-sock-ro"
read_only = true
}
// we only want a minimal size ephemeral disk for Vector's data
// this stops us running out of space
ephemeral_disk {
size = 500
sticky = true
}
task "vector" {
// this task runs vector, listening to the docker socket for logs, and
// annototates them with metadata so you can identify the jobs, tasks,
// nodes and so on, then sends the enriched logs to loki
driver = "docker"
config {
image = "timberio/vector:SOME_VERSION_PINNED"
ports = ["api"]
}
# docker socket volume mount
volume_mount {
volume = "docker-sock"
destination = "/var/run/docker.sock"
read_only = true
}
# Vector won't start unless the sinks(backends) configured are healthy
env {
VECTOR_CONFIG = "local/vector.toml"
VECTOR_REQUIRE_HEALTHY = "true"
}
# resource limits are a good idea because you don't want your log collection
// to consume all resources available
resources {
cpu = 500 # 500 MHz
memory = 256 # 256MB
}
# template with Vector's configuration
template {
destination = "local/vector.toml"
change_mode = "signal"
change_signal = "SIGHUP"
# overriding the delimiters to [[ ]] to avoid conflicts with Vector's native templating, which also uses {{ }}
left_delimiter = "[["
right_delimiter = "]]"
data=<<EOH
data_dir = "alloc/data/vector/"
[api]
enabled = true
address = "0.0.0.0:8686"
playground = true
[sources.logs]
type = "docker_logs"
[sinks.out]
type = "console"
inputs = [ "logs" ]
encoding.codec = "json"
[sinks.loki]
type = "loki"
inputs = ["logs"]
endpoint = "http://<LOKI_SERVER_PRIVATE_IP_ADDRESS>:3100"
encoding.codec = "json"
healthcheck.enabled = true
# since . is used by Vector to denote a parent-child relationship, and Nomad's Docker labels contain ".",
# we need to escape them twice, once for TOML, once for Vector
labels.job = "{{ label.com\\.hashicorp\\.nomad\\.job_name }}"
labels.task = "{{ label.com\\.hashicorp\\.nomad\\.task_name }}"
labels.group = "{{ label.com\\.hashicorp\\.nomad\\.task_group_name }}"
labels.namespace = "{{ label.com\\.hashicorp\\.nomad\\.namespace }}"
labels.node = "{{ label.com\\.hashicorp\\.nomad\\.node_name }}"
# remove fields that have been converted to labels to avoid having the field twice
remove_label_fields = true
EOH
}
service {
provider = "nomad"
check {
port = "api"
type = "http"
path = "/health"
interval = "30s"
timeout = "5s"
}
}
kill_timeout = "30s"
}
}
}
I have been using Nomad and docker because I appreciate the logging, and back when I last looked, if I wanted to run any jobs and have them labelled going into Loki, I had to use some kind of custom vector setup based around this software here:
There is a good writeup here from 2022 explaining how it all works:
Essentially it runs a special config that listens for new log files being created and then uses vector to relabel each line before sending it on to Loki.
That was good practice in 2022, but it seems weird to need a whole separate system job just to add these labels.
Has there been anything more native to Nomad, that means I don’t need to run another system job to give my allocations some more meaningful labels?
If not, I know the system works, so can implement it, but it seemed I’m probably not the only other person who would benefit from knowing if there is an approach out there that uses fewer moving parts.
Thank you in advance, and for continued development on Nomad!