Help about to understand nomad (script) health checks

Hello, I have a problem understanding the health checks at nomad. I created a small example to show what I would like to do:

job "test" {
    datacenters = ["dc1"]
    type ="service"

    group "test" {

    task "test" {
        driver = "podman"

        config {
            image = "busybox"
            hostname = "busybox"
            command = "sh"
            args = ["-c", "while true; do sleep 3600; done"]
        }

        service {
          check {
            type    = "script"
            command = "/bin/sh"
            args    = ["-c", "test -f /bin/tr"]
            interval = "10s"
            timeout  = "2s"
          }
        }

        resources {
            cpu    = 1000
            memory = 512
        }
    }
  }
} 

It should just be checked whether a file is available or not (here as an example: /bin/tr - this file exists in busybox) or a script should be run that carries out some checks. If the file is no longer available, the job should be stopped and started on a new client. How can I implement that or where is my mistake at this point.

If I actually run the existing Nomad File, I never get to the Healthy state. I hope someone can help me or explain it. I would be very happy about help :smiley:

I had a similar experience running containers, where the service never came back healthy from a script health check. In my case, it was a mongo command to see if the database was healthy.

I naively added the script to check the status, and I could see from the container logs that the script was being executed, but the service never became healthy in Consul.

I dug a bit deeper in the docs, and on the Checks page I noticed:

In Consul 0.9.0 and later, script checks are not enabled by default.

After I enabled that, the service was healthy.

Hi. Thank you for answering my question. I hadn’t had anything to do with Consul. All I have is a nomad file to start my server/cluster:

data_dir = "/tmp/nomad"
bind_addr = "..."

server {
  enabled = true
  bootstrap_expect = 1

  server_join {
    retry_join = ["...", "...", "..."]
  }
}

client {
  enabled = true
} 

Do I need consul? Nowhere in the documentation does it say I need this! If you read the first sentence on: service Stanza - Job Specification | Nomad by HashiCorp, it says: " with the specified provider; Nomad or Consul"

So I assume that nomad itself also supports health checks. Or not???

Hi!

Thanks for clarifying that you are not using Consul - I had just assumed that, since to my knowledge, as a user, that is the only mechanism to discover services.

Getting back to your problem - the script health check not coming back healthy - I don’t see anything in Nomad that would prevent that, so I’m assuming that while Nomad is actually executing the probes, it is not reporting it anywhere. However, I must admit that this doesn’t really convince me and I stand to be corrected.

You should be able to deploy Nomad without Consul, but the documentation on services states:

The service stanza instructs Nomad to register a service with the specified provider; Nomad or Consul

In your case the service provider would be Nomad. However, the service discovery part says:

Nomad schedules workloads of various types across a cluster of generic hosts. Because of this, placement is not known in advance and you will need to use service discovery to connect tasks to other services deployed across your cluster. Nomad integrates with Consul to provide service discovery and monitoring.

So this takes me back to understanding that if you want to implement probes, they need to report status to Consul in order to have service discovery.

I do note, however, that there is a native service discovery feature in the Nomad 1.3 beta.

This might be what you are looking for.

Hope this helps, and I’d love to hear from someone from the product team to help us understand better :slight_smile:

1 Like

@brucellino1 Thank you for the great explanation :slight_smile: . I’ll add that 1.3 does have native service discovery but it doesn’t support health checks at this time.

@cmerbach To add a bit to your initial question

It should just be checked whether a file is available or not (here as an example: /bin/tr - this file exists in busybox) or a script should be run that carries out some checks. If the file is no longer available, the job should be stopped and started on a new client. How can I implement that or where is my mistake at this point.

The restart stanza docs and the reschedule stanza docs have some good information on how job failure & client rescheduling works, and you can use both stanzas to customize how you’d want the failover process to work ( once you get service discovery configured :+1: )

1 Like