Understanding how Nomad does healthchecks and avoid false positive error logs

mrchrisadams · August 28, 2023, 10:06am

Hi there.

I’m trying to understand how Nomad’s healthchecks work when you have a running job, and I think I need some pointers from someone more familiar than me with Nomad.

I’m using RabbitMQ in a container, and I’m seeing a connections that keep timing out - after asking in the RabbitMQ discord, a kind soul pointed out that they’re happening every 20 seconds or so, which might be related to the healthchecks defined for a service.

Here’s what my jobfile looks like (lightly abridged):

job "rabbitmq_server" {
  datacenters = ["dc1"]

  type = "service"

  group "rabbit" {

    count = 1

    volume "persistent_data" {
      type      = "host"
      read_only = false
      source    = "persistent_data_rabbitmq"
    }

    network {
      mode = "host"

      port "rabbit" {
        static       = 5672
        host_network = "my-network"
      }

      # used by some CLI tools. See below for more:
      # https://rabbitmq.com/networking.html
      port "rabbit-epmd" {
        static       = 4369
        host_network = "my-network"
      }
      
      # used for management 
      # and for prometheus metrics
      # https://rabbitmq.com/management.html
      port "rabbit-dashboard" {
        static        = 15672
        host_network  = "my-network"
      }

    }

    task "rabbit" {
      driver = "docker"

      template {
        data        = file("./nomad/local/rabbitmq.enabled.plugins.tpl")
        destination = "/etc/rabbitmq/enabled_plugins"
      }

      config {
        image = "rabbitmq:3.12-management"
        ports = ["rabbit", "rabbit-dashboard", "rabbit-epmd"]
        hostname = "greenrabbit"
      }

      service {
        provider = "nomad"
        name = "rabbit-dashboard"
        port = "rabbit-dashboard"
      }

      service {
        provider = "nomad"
        name = "rabbitmq"
        port = "rabbit"

        # check every 10s that the we can connect over tcp
        check {
           type     = "tcp"
           port     = "rabbit"
           interval = "10s"
           timeout  = "2s"
         }
      }
    
      resources {
        cpu    = 1000 # 1000 MHz
        memory = 1024 # 1024 GB
      }

      volume_mount {
        volume      = "persistent_data"
        destination = "/var/lib/rabbitmq"
        read_only   = false
      }
    }
  }
}

Here’s what the logs from RabbitMQ look like when they log the connections that keep timing out:

2023-08-28 08:59:04.172080+00:00 [info] <0.24441.0> accepting AMQP connection <0.24441.0> (10.0.0.8:43482 -> 172.17.0.3:5672)
2023-08-28 08:59:04.172440+00:00 [error] <0.24441.0> closing AMQP connection <0.24441.0> (10.0.0.8:43482 -> 172.17.0.3:5672):
2023-08-28 08:59:04.172440+00:00 [error] <0.24441.0> {handshake_timeout,handshake}
2023-08-28 08:59:14.174166+00:00 [info] <0.24446.0> accepting AMQP connection <0.24446.0> (10.0.0.8:48890 -> 172.17.0.3:5672)
2023-08-28 08:59:14.174574+00:00 [error] <0.24446.0> closing AMQP connection <0.24446.0> (10.0.0.8:48890 -> 172.17.0.3:5672):
2023-08-28 08:59:14.174574+00:00 [error] <0.24446.0> {handshake_timeout,handshake}
2023-08-28 08:59:34.175254+00:00 [info] <0.24456.0> accepting AMQP connection <0.24456.0> (10.0.0.8:43122 -> 172.17.0.3:5672)
2023-08-28 08:59:34.175601+00:00 [error] <0.24456.0> closing AMQP connection <0.24456.0> (10.0.0.8:43122 -> 172.17.0.3:5672):
2023-08-28 08:59:34.175601+00:00 [error] <0.24456.0> {handshake_timeout,handshake}
2023-08-28 08:59:54.178068+00:00 [info] <0.24467.0> accepting AMQP connection <0.24467.0> (10.0.0.8:34150 -> 172.17.0.3:5672)
2023-08-28 08:59:54.178354+00:00 [error] <0.24467.0> closing AMQP connection <0.24467.0> (10.0.0.8:34150 -> 172.17.0.3:5672):
2023-08-28 08:59:54.178354+00:00 [error] <0.24467.0> {handshake_timeout,handshake}
2023-08-28 09:00:14.180311+00:00 [info] <0.24481.0> accepting AMQP connection <0.24481.0> (10.0.0.8:35264 -> 172.17.0.3:5672)
2023-08-28 09:00:14.180522+00:00 [error] <0.24481.0> closing AMQP connection <0.24481.0> (10.0.0.8:35264 -> 172.17.0.3:5672):
2023-08-28 09:00:14.180522+00:00 [error] <0.24481.0> {handshake_timeout,handshake}
2023-08-28 09:00:24.181133+00:00 [info] <0.24484.0> accepting AMQP connection <0.24484.0> (10.0.0.8:48350 -> 172.17.0.3:5672)
2023-08-28 09:00:24.181351+00:00 [error] <0.24484.0> closing AMQP connection <0.24484.0> (10.0.0.8:48350 -> 172.17.0.3:5672):
2023-08-28 09:00:24.181351+00:00 [error] <0.24484.0> {handshake_timeout,handshake}
2023-08-28 09:00:44.183188+00:00 [info] <0.24497.0> accepting AMQP connection <0.24497.0> (10.0.0.8:54054 -> 172.17.0.3:5672)
2023-08-28 09:00:44.183590+00:00 [error] <0.24497.0> closing AMQP connection <0.24497.0> (10.0.0.8:54054 -> 172.17.0.3:5672):
2023-08-28 09:00:44.183590+00:00 [error] <0.24497.0> {handshake_timeout,handshake}

When I switched off the healthchecks, the failed connections stopped being logged, so I’m now fairly confident Nomad is causing them now.

However, I’m not sure how best to deal with this.

I get that I can filter them out in something like loki, but I’d much rather they just don’t happen in the first place, and for that to be possible, I think I either need to figure out how to tweak the healthcheck, or tell RabbitMQ to avoid logging these connections as failed attempted AMQP connections.

Is there any guidance on how to make sure healthchecks don’t lead to timeout connections like I’m seeing here, while still keeping the actual healthchecks?

Topic		Replies	Views
Nomad Peer Healthchecks ~ 0.9.2 Nomad	0	396	June 21, 2019
How to use Nomad health checks when using bridge networking? Nomad	3	173	July 13, 2024
Service health check with nomad service provider Nomad	3	450	September 16, 2022
Nomad service health check static IP per node Nomad	3	840	May 5, 2020
Group->network vs. group->task->resources->network Nomad	5	1545	March 12, 2021

Understanding how Nomad does healthchecks and avoid false positive error logs

Related topics