Unmonitored Tasks

G’day, We’ve been running Nomad in a small 3 node cluster for over a year, Today I upgraded to 5 Nodes, However I did screw up and the cluster lost it’s leader, then somehow managed to elect one of the new nodes… and I had to re-add all the tasks.

But that’s not the issue, We went from Version 1.3.1 to version 1.3.3, We also run Server 2019 and Windows Containers
Since the update now all of my docker containers wait the 30 minutes then report as unhealthy and initiate a kill and restart for the 2 times specified then die after that.

We don’t run consul, all these apps have no outwards ports open for checking, They are all processing data from RabbitMQ.
So we really only need to know if the container falls over, the rest doesn’t matter.
I used the same config as always when I started them up, but now they are all just falling over every hour. It’s not because the containers are failing, Nomad is killing them.

At the moment I’ve started a manual docker for each of the affected services, However I can’t figure out how to have a task be unmonitored like this.
Docs are all focused towards running Consul, Which is overkill for what we are doing, If we need Consul, then maybe we need it, but we didn’t until today.
Here is an example config… It’s really basic as we don’t need a ton of complexity for what we’re doing with it.

job "MasterScheduler" {
  
  #region = "global"  
  datacenters = ["MelbDC"]
  type = "service"
  constraint {
    distinct_hosts = true
   }

  update {
    stagger      = "30s"
    max_parallel = 1
  }

  group "Services" {
    # Specify the number of these tasks we want.
    count = 2
 
    task "MasterScheduler" {
           driver = "docker"

      # Configuration is specific to each driver.
      config {
        image = "<Name>.azurecr.io/services/masterscheduler"

        auth {
          username = "<RandomStuff>"
          password = "<OtherStuff>"
          server_address  = "<Name>.azurecr.io"
        }
       
      }
     

      # Specify the maximum resources required to run the task,
      # include CPU, memory, and bandwidth.
      resources {
        cpu    = 300 # MHz
        memory = 150 # MB

           }
    }
  }
}

Any help would be greatly appreciated

Hi @Biztactix-Ryan, could you clarify the requirement you are looking to solve as I am not fully able to understand it? If you could also provide some logs showing the current process which is causing you issue that would be useful.

Thanks,
jrasell and the Nomad team

I’ll run up some tasks again, basically they’ve all died so I haven’t restarted them,
Basically, Tasks that were running in 1.3.1 just fine, are now running for 30 minutes and then Nomad is killing them
I’m thinking that perhaps some of the unspecified job parameters have had their defaults changed so that now it’s trying to monitor the job and failing to do so.

I’ve started some jobs up and I’ll copy out the allocation tasks and times out to show how it’s going down. But it’s night over here, so I’ll have to wait until the morning to grab them.

I tell a lie, It lasted like 15 Minutes, and weirdly, 1 of them has been running for 25 minutes

And that’s dead too… 4/4


Found this too, not sure where else to grab from

Because I stayed up to grab those logs, I figured I’d go the next step and run the server in command line and watch the logs until the crash…
Looks like an RPC issue with Docker possibly.

    2022-08-16T21:05:21.862+1000 [INFO]  client.driver_mgr.docker: stopped container: container_id=43e4dbb2e3000ab78eb7cd18f7316e439f04b8de7d9111aa24bbc2767cbd1b4f driver=docker
    2022-08-16T21:05:21.866+1000 [DEBUG] client.driver_mgr.docker.docker_logger.stdio: received EOF, stopping recv loop: driver=docker err="rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:54859->127.0.0.1:10001: wsarecv: An existing connection was forcibly closed by the remote host."
    2022-08-16T21:05:21.880+1000 [DEBUG] client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker path=c:\ProgramData\chocolatey\lib\nomad\tools\nomad.exe pid=10984
    2022-08-16T21:05:21.887+1000 [DEBUG] client.driver_mgr.docker.docker_logger: plugin exited: driver=docker
    2022-08-16T21:05:22.661+1000 [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:b8c8ca4eb020884d89c0074982a4b0f70dcf3f94b82b89e165fd9fdc8465cae7 references=0
    2022-08-16T21:05:24.251+1000 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.stdio: received EOF, stopping recv loop: alloc_id=8bcce766-0ac3-8867-00dd-6e19c40e545e task=MasterScheduler err="rpc error: code = Unavailable desc = error reading from server: EOF"
    2022-08-16T21:05:24.276+1000 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=8bcce766-0ac3-8867-00dd-6e19c40e545e task=MasterScheduler path=c:\ProgramData\chocolatey\lib\nomad\tools\nomad.exe pid=784
    2022-08-16T21:05:24.280+1000 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=8bcce766-0ac3-8867-00dd-6e19c40e545e task=MasterScheduler
    2022-08-16T21:05:24.281+1000 [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=8bcce766-0ac3-8867-00dd-6e19c40e545e task=MasterScheduler
    2022-08-16T21:05:24.573+1000 [DEBUG] client.gc: alloc garbage collected: alloc_id=8bcce766-0ac3-8867-00dd-6e19c40e545e

Any thoughts about what to do with this?

I couldn’t wait any longer, Cleared all the server config, Rebuilt the cluster from scratch… Working fine for 12 hours so far.