Unmonitored Tasks

Biztactix-Ryan · August 13, 2022, 6:39am

G’day, We’ve been running Nomad in a small 3 node cluster for over a year, Today I upgraded to 5 Nodes, However I did screw up and the cluster lost it’s leader, then somehow managed to elect one of the new nodes… and I had to re-add all the tasks.

But that’s not the issue, We went from Version 1.3.1 to version 1.3.3, We also run Server 2019 and Windows Containers
Since the update now all of my docker containers wait the 30 minutes then report as unhealthy and initiate a kill and restart for the 2 times specified then die after that.

We don’t run consul, all these apps have no outwards ports open for checking, They are all processing data from RabbitMQ.
So we really only need to know if the container falls over, the rest doesn’t matter.
I used the same config as always when I started them up, but now they are all just falling over every hour. It’s not because the containers are failing, Nomad is killing them.

At the moment I’ve started a manual docker for each of the affected services, However I can’t figure out how to have a task be unmonitored like this.
Docs are all focused towards running Consul, Which is overkill for what we are doing, If we need Consul, then maybe we need it, but we didn’t until today.
Here is an example config… It’s really basic as we don’t need a ton of complexity for what we’re doing with it.

job "MasterScheduler" {
  
  #region = "global"  
  datacenters = ["MelbDC"]
  type = "service"
  constraint {
    distinct_hosts = true
   }

  update {
    stagger      = "30s"
    max_parallel = 1
  }

  group "Services" {
    # Specify the number of these tasks we want.
    count = 2
 
    task "MasterScheduler" {
           driver = "docker"

      # Configuration is specific to each driver.
      config {
        image = "<Name>.azurecr.io/services/masterscheduler"

        auth {
          username = "<RandomStuff>"
          password = "<OtherStuff>"
          server_address  = "<Name>.azurecr.io"
        }
       
      }
     

      # Specify the maximum resources required to run the task,
      # include CPU, memory, and bandwidth.
      resources {
        cpu    = 300 # MHz
        memory = 150 # MB

           }
    }
  }
}

Any help would be greatly appreciated

jrasell · August 15, 2022, 8:24am

Hi @Biztactix-Ryan, could you clarify the requirement you are looking to solve as I am not fully able to understand it? If you could also provide some logs showing the current process which is causing you issue that would be useful.

Thanks,
jrasell and the Nomad team

Biztactix-Ryan · August 16, 2022, 9:41am

I’ll run up some tasks again, basically they’ve all died so I haven’t restarted them,
Basically, Tasks that were running in 1.3.1 just fine, are now running for 30 minutes and then Nomad is killing them
I’m thinking that perhaps some of the unspecified job parameters have had their defaults changed so that now it’s trying to monitor the job and failing to do so.

I’ve started some jobs up and I’ll copy out the allocation tasks and times out to show how it’s going down. But it’s night over here, so I’ll have to wait until the morning to grab them.

Biztactix-Ryan · August 16, 2022, 9:59am

I tell a lie, It lasted like 15 Minutes, and weirdly, 1 of them has been running for 25 minutes

Biztactix-Ryan · August 16, 2022, 10:20am

And that’s dead too… 4/4

Biztactix-Ryan · August 16, 2022, 10:22am

Found this too, not sure where else to grab from

Biztactix-Ryan · August 16, 2022, 11:14am

Because I stayed up to grab those logs, I figured I’d go the next step and run the server in command line and watch the logs until the crash…
Looks like an RPC issue with Docker possibly.

    2022-08-16T21:05:21.862+1000 [INFO]  client.driver_mgr.docker: stopped container: container_id=43e4dbb2e3000ab78eb7cd18f7316e439f04b8de7d9111aa24bbc2767cbd1b4f driver=docker
    2022-08-16T21:05:21.866+1000 [DEBUG] client.driver_mgr.docker.docker_logger.stdio: received EOF, stopping recv loop: driver=docker err="rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:54859->127.0.0.1:10001: wsarecv: An existing connection was forcibly closed by the remote host."
    2022-08-16T21:05:21.880+1000 [DEBUG] client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker path=c:\ProgramData\chocolatey\lib\nomad\tools\nomad.exe pid=10984
    2022-08-16T21:05:21.887+1000 [DEBUG] client.driver_mgr.docker.docker_logger: plugin exited: driver=docker
    2022-08-16T21:05:22.661+1000 [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:b8c8ca4eb020884d89c0074982a4b0f70dcf3f94b82b89e165fd9fdc8465cae7 references=0
    2022-08-16T21:05:24.251+1000 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.stdio: received EOF, stopping recv loop: alloc_id=8bcce766-0ac3-8867-00dd-6e19c40e545e task=MasterScheduler err="rpc error: code = Unavailable desc = error reading from server: EOF"
    2022-08-16T21:05:24.276+1000 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=8bcce766-0ac3-8867-00dd-6e19c40e545e task=MasterScheduler path=c:\ProgramData\chocolatey\lib\nomad\tools\nomad.exe pid=784
    2022-08-16T21:05:24.280+1000 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=8bcce766-0ac3-8867-00dd-6e19c40e545e task=MasterScheduler
    2022-08-16T21:05:24.281+1000 [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=8bcce766-0ac3-8867-00dd-6e19c40e545e task=MasterScheduler
    2022-08-16T21:05:24.573+1000 [DEBUG] client.gc: alloc garbage collected: alloc_id=8bcce766-0ac3-8867-00dd-6e19c40e545e

Biztactix-Ryan · August 23, 2022, 5:09am

Any thoughts about what to do with this?

Biztactix-Ryan · August 23, 2022, 8:42pm

I couldn’t wait any longer, Cleared all the server config, Rebuilt the cluster from scratch… Working fine for 12 hours so far.

Topic		Replies	Views
How to track down what killed a task? Nomad	10	841	August 16, 2022
What is the best method to know if a docker container is dead? Nomad nomad-release	2	822	November 18, 2020
How to run a task and let it finish Nomad consul-nomad , nomad-pack	4	584	March 10, 2022
Monitoring tasks/service with prometheus Nomad	3	1148	May 22, 2023
Using Prometheus to Monitor Nomad Metrics without Consul Nomad	2	349	May 24, 2023

Unmonitored Tasks

Related topics