Using nomad, and sending logs with job / task labels into Loki and friends

Hi folks,

I’m coming back to doing some work with Nomad, for our setup that centralises logs by sending them to a central Loki server. We use promtail to send logs for non-Nomad, non-docker processes, and we use a system vector job running on all hosts to annotate Docker logs. It looks like this:

// this job sets up vector on every node in the cluster,
// allowing all the docker logs to be collected and sent to Loki

job "vector" {
  datacenters = ["dc1"]
  # system job, runs on all nodes
  type = "system"

  //   
  update {
    min_healthy_time = "10s"
    healthy_deadline = "5m"
    progress_deadline = "10m"
    auto_revert = true
  }
  group "vector" {
    count = 1
    restart {
      attempts = 3
      interval = "10m"
      delay = "30s"
      mode = "fail"
    }
    network {
      // api is used for health checks
      port "api" {
        to = 8686
        host_network = "my-private-network"
      }
    }

    // declare a volume to mount the docker socket
    volume "docker-sock" {
      type = "host"
      source = "docker-sock-ro"
      read_only = true
    }
    // we only want a minimal size ephemeral disk for Vector's data
    // this stops us running out of space
    ephemeral_disk {
      size    = 500
      sticky  = true
    }
    task "vector" {

      // this task runs vector, listening to the docker socket for logs, and 
      // annototates them with metadata so you can identify the jobs, tasks, 
      // nodes and so on, then sends the enriched logs to loki

      driver = "docker"
      config {
        image = "timberio/vector:SOME_VERSION_PINNED"
        ports = ["api"]
      }
      # docker socket volume mount
      volume_mount {
        volume = "docker-sock"
        destination = "/var/run/docker.sock"
        read_only = true
      }
      # Vector won't start unless the sinks(backends) configured are healthy
      env {
        VECTOR_CONFIG = "local/vector.toml"
        VECTOR_REQUIRE_HEALTHY = "true"
      }
      # resource limits are a good idea because you don't want your log collection 
      // to consume all resources available
      resources {
        cpu    = 500 # 500 MHz
        memory = 256 # 256MB
      }
      # template with Vector's configuration
      template {
        destination = "local/vector.toml"
        change_mode   = "signal"
        change_signal = "SIGHUP"
        # overriding the delimiters to [[ ]] to avoid conflicts with Vector's native templating, which also uses {{ }}
        left_delimiter = "[["
        right_delimiter = "]]"
        data=<<EOH
data_dir = "alloc/data/vector/"
[api]
  enabled = true
  address = "0.0.0.0:8686"
  playground = true
[sources.logs]
  type = "docker_logs"
[sinks.out]
  type = "console"
  inputs = [ "logs" ]
  encoding.codec = "json"
[sinks.loki]
  type = "loki"
  inputs = ["logs"]
  endpoint = "http://<LOKI_SERVER_PRIVATE_IP_ADDRESS>:3100"
  encoding.codec = "json"
  healthcheck.enabled = true
  # since . is used by Vector to denote a parent-child relationship, and Nomad's Docker labels contain ".",
  # we need to escape them twice, once for TOML, once for Vector
  labels.job = "{{ label.com\\.hashicorp\\.nomad\\.job_name }}"
  labels.task = "{{ label.com\\.hashicorp\\.nomad\\.task_name }}"
  labels.group = "{{ label.com\\.hashicorp\\.nomad\\.task_group_name }}"
  labels.namespace = "{{ label.com\\.hashicorp\\.nomad\\.namespace }}"
  labels.node = "{{ label.com\\.hashicorp\\.nomad\\.node_name }}"
  # remove fields that have been converted to labels to avoid having the field twice
  remove_label_fields = true
        EOH
      }
      service {
        provider = "nomad"
        check {
          port     = "api"
          type     = "http"
          path     = "/health"
          interval = "30s"
          timeout  = "5s"
        }
      }
      kill_timeout = "30s"
    }
  }
}

I have been using Nomad and docker because I appreciate the logging, and back when I last looked, if I wanted to run any jobs and have them labelled going into Loki, I had to use some kind of custom vector setup based around this software here:

There is a good writeup here from 2022 explaining how it all works:

Essentially it runs a special config that listens for new log files being created and then uses vector to relabel each line before sending it on to Loki.

That was good practice in 2022, but it seems weird to need a whole separate system job just to add these labels.

Has there been anything more native to Nomad, that means I don’t need to run another system job to give my allocations some more meaningful labels?

If not, I know the system works, so can implement it, but it seemed I’m probably not the only other person who would benefit from knowing if there is an approach out there that uses fewer moving parts.

Thank you in advance, and for continued development on Nomad!

1 Like

I can see there is a post here. This seems to be doing something similar what I already have with Docker. I’m still looking for a more elegant approach to non-Nomad jobs though, so I’d welcome pointers.