Reason for a task being killed

We’re running Nomad in Windows. We’re running a number of jobs, each with 1 task which is started using command line:

"Tasks": [
        {
          "Name": "test",
          "Driver": "raw_exec",
          "Config": {
            "args": [
             ...
            ],
            "command": "run.exe"
          }
 ...
]

Also, an HTTP port is reserved:

      "Networks": [
        {
          ...
          "ReservedPorts": [
            {
              "Label": "HTTP",
              "Value": 15021,
              "To": 0,
              "HostNetwork": "default"
            }
          ],
          "DynamicPorts": null
        }
      ],

We often see that the task is killed by Nomad. The event says: “Sent interrupt. Waiting 21s before force killing”. We have no health checks defined.

The fact is that this seems to happens at the moment when we’re sending a lot of HTTP requests on this port. Still, I don’t think the task would actually suffer from this. But for some reason, Nomad decides to kill it.

My question is: how does Nomad decide whether to kill this job.

Hi @roelvz,

Do you have the task logs or client logs from Nomad? This would help identify what exactly is happening here.

Nomad would chose to kill a task potentially if a job update has been triggered which was a destructive update and required placing new allocations.

Thanks,
jrasell and the Nomad team

Hi @jrasell,

I have these logs from the Nomad agent using nomad monitor TRACE (I attached a file with all logs):

2023-09-13T21:11:27.718+0200 [TRACE] client.alloc_runner.task_runner: Kill requested: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 task=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9
2023-09-13T21:11:27.718+0200 [TRACE] client.alloc_runner.task_runner: Kill event: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 task=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9 event_type=Killing event_reason=""

A few seconds later I see this:

2023-09-13T21:11:29.335+0200 [DEBUG] client.driver_mgr.raw_exec.executor.stdio: received EOF, stopping recv loop: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 driver=raw_exec task_name=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9 err="rpc error: code = Unavailable desc = error reading from server: EOF"
2023-09-13T21:11:29.335+0200 [TRACE] client.driver_mgr.raw_exec: executor Stats stream closed: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 driver=raw_exec task_name=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9 msg="rpc error: code = Unavailable desc = error reading from server: EOF"
2023-09-13T21:11:29.385+0200 [DEBUG] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 driver=raw_exec task_name=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9 path=C:\liveos\nomad\bin\nomad.exe pid=20684
2023-09-13T21:11:29.385+0200 [DEBUG] client.driver_mgr.raw_exec.executor: plugin exited: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 driver=raw_exec task_name=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9
2023-09-13T21:11:29.390+0200 [TRACE] client.alloc_runner: handling task state update: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 done=false
2023-09-13T21:11:29.394+0200 [TRACE] client.alloc_runner.task_runner: setting task state: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 task=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9 state=dead
2023-09-13T21:11:29.398+0200 [TRACE] client.alloc_runner.task_runner: Kill requested: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 task=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9
2023-09-13T21:11:29.403+0200 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.stdio: received EOF, stopping recv loop: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 task=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9 err="rpc error: code = Unavailable desc = error reading from server: EOF"
2023-09-13T21:11:29.437+0200 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 task=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9 path=C:\liveos\nomad\bin\nomad.exe pid=2956
2023-09-13T21:11:29.437+0200 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 task=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9
2023-09-13T21:11:29.437+0200 [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 task=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9
2023-09-13T21:11:29.437+0200 [TRACE] client.alloc_runner.task_coordinator: state transition: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 from=wait_alloc to=poststart
2023-09-13T21:11:29.438+0200 [INFO]  client.gc: marking allocation for GC: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180
2023-09-13T21:11:29.438+0200 [TRACE] client.alloc_runner: sending updated alloc: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 client_status=complete desired_status=""
2023-09-13T21:11:29.438+0200 [TRACE] client.alloc_runner: handling task state update: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 done=false
2023-09-13T21:11:29.438+0200 [TRACE] client.alloc_runner: sending updated alloc: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 client_status=complete desired_status=""
2023-09-13T21:11:29.438+0200 [TRACE] client.alloc_runner: handling task state update: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 done=true
2023-09-13T21:11:29.438+0200 [DEBUG] client.alloc_migrator: waiting for previous alloc to terminate: alloc_id=091d2192-748f-297f-9ba0-b78536434174 previous_alloc=f11117e4-b06d-7421-ff07-aa32b1fa5180
2023-09-13T21:11:29.438+0200 [TRACE] client.alloc_runner: sending updated alloc: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 client_status=complete desired_status=""

So it’s being killed, but I see no reason why. I’m pretty sure no resource limits are violated, and there’s no health checks defined. The task is running fine as far as I can see. So I’m wondering what the reasons could be for a job/task to be killed (this is a job with a single task).

Thank you,
Roel

nomad_agent_log.txt (35.9 KB)

Hello @jrasell. I was looking in the source code to see if I can find the reason why a job would be killed, but don’t know where to start. Maybe you could give me some direction of where I could look for some place in the code where a windows process is killed?