Bootstrap an Alertmanager cluster on nomad

Hi all,

I want to run an Alertmanager cluster on nomad. For my single instance setup I have this job definition: (Levant template)

job "alertmanager" {
  type = "service"
  datacenters = ["dc1"]
  constraint {
    attribute = "${node.class}"
    value = "app"
  }
  spread {
    attribute = "${node.unique.id}"
    weight    = 100
  }
  group "alertmanager" {
    network {
      port "http" {
        static = 80
        to = 9093
        host_network = "internal"
      }
    }
    task "alertmanager" {
      driver = "docker"
      config {
        image = "prom/alertmanager:v0.23.0"
        args = [
          "--config.file=/local/config.yml"
        ]
        force_pull = true
        ports = [
          "http",
        ]
      }
      vault {
        policies  = ["alertmanager"]
      }
      resources {
        memory = 1024
        cpu = 1000
      }
      template {
        data        = <<EOF
[[ fileContents "alertmanager/config.yml" ]]
EOF
        destination = "local/config.yml"
        change_mode = "signal"
        change_signal = "SIGHUP"
      }
    }
    count = 1
    service {
      port = "http"
      name = "alertmanager"
      check {
        type     = "http"
        protocol = "http"
        port     = "http"
        path     = "/-/healthy"
        interval = "10s"
        timeout  = "3s"
      }
    }
  }
}

In order to run the Alertmanager as three-instance cluster my plan is to increase the task/group count and modify the docker container args as explained here: GitHub - prometheus/alertmanager: Prometheus Alertmanager

The problem is, I can’t get the IP/Port of the other instances to add them as argument. I assume It’s also very hard to support that since allocations are “separate” and IP/ports change etc.

To solve that my backup plan was to use the consul dns address alertmanager.service.consul for --cluster.peer parameter. The problem with this solution is that a starting task is unable to resolve alertmanager.service.consul and fails. So the others too…
Maybe someone has an idea how to solve this chicken and egg issue :slight_smile:
Thanks!

A method I often use is to add Consul metada in the Consul configuration only for the required nodes, specific to the service I am using (here alertmanager).

Then use Nomad’s template block to lookup IPs of nodes with the specific metadata, so all nodes of said cluster gets IPs of all nodes.

Remember to set template change to noop otherwise if a machine is terminated and recreated, there would be a sudden slew of service restarts.

When you do end up recreating the underlying VM, you would have to do a staggered restart (alloc restart) manually to update the configurations on the older instances.