roelvz
September 13, 2023, 6:42pm
1
We’re running Nomad in Windows. We’re running a number of jobs, each with 1 task which is started using command line:
"Tasks": [
{
"Name": "test",
"Driver": "raw_exec",
"Config": {
"args": [
...
],
"command": "run.exe"
}
...
]
Also, an HTTP port is reserved:
"Networks": [
{
...
"ReservedPorts": [
{
"Label": "HTTP",
"Value": 15021,
"To": 0,
"HostNetwork": "default"
}
],
"DynamicPorts": null
}
],
We often see that the task is killed by Nomad. The event says: “Sent interrupt. Waiting 21s before force killing”. We have no health checks defined.
The fact is that this seems to happens at the moment when we’re sending a lot of HTTP requests on this port. Still, I don’t think the task would actually suffer from this. But for some reason, Nomad decides to kill it.
My question is: how does Nomad decide whether to kill this job.
jrasell
September 14, 2023, 9:40am
2
Hi @roelvz ,
Do you have the task logs or client logs from Nomad? This would help identify what exactly is happening here.
Nomad would chose to kill a task potentially if a job update has been triggered which was a destructive update and required placing new allocations.
Thanks,
jrasell and the Nomad team
roelvz
September 14, 2023, 9:56am
3
Hi @jrasell ,
I have these logs from the Nomad agent using nomad monitor TRACE
(I attached a file with all logs):
2023-09-13T21:11:27.718+0200 [TRACE] client.alloc_runner.task_runner: Kill requested: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 task=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9
2023-09-13T21:11:27.718+0200 [TRACE] client.alloc_runner.task_runner: Kill event: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 task=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9 event_type=Killing event_reason=""
A few seconds later I see this:
2023-09-13T21:11:29.335+0200 [DEBUG] client.driver_mgr.raw_exec.executor.stdio: received EOF, stopping recv loop: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 driver=raw_exec task_name=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9 err="rpc error: code = Unavailable desc = error reading from server: EOF"
2023-09-13T21:11:29.335+0200 [TRACE] client.driver_mgr.raw_exec: executor Stats stream closed: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 driver=raw_exec task_name=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9 msg="rpc error: code = Unavailable desc = error reading from server: EOF"
2023-09-13T21:11:29.385+0200 [DEBUG] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 driver=raw_exec task_name=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9 path=C:\liveos\nomad\bin\nomad.exe pid=20684
2023-09-13T21:11:29.385+0200 [DEBUG] client.driver_mgr.raw_exec.executor: plugin exited: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 driver=raw_exec task_name=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9
2023-09-13T21:11:29.390+0200 [TRACE] client.alloc_runner: handling task state update: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 done=false
2023-09-13T21:11:29.394+0200 [TRACE] client.alloc_runner.task_runner: setting task state: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 task=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9 state=dead
2023-09-13T21:11:29.398+0200 [TRACE] client.alloc_runner.task_runner: Kill requested: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 task=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9
2023-09-13T21:11:29.403+0200 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.stdio: received EOF, stopping recv loop: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 task=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9 err="rpc error: code = Unavailable desc = error reading from server: EOF"
2023-09-13T21:11:29.437+0200 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 task=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9 path=C:\liveos\nomad\bin\nomad.exe pid=2956
2023-09-13T21:11:29.437+0200 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 task=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9
2023-09-13T21:11:29.437+0200 [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 task=APPR-1-gRxtQLLSOFjXDlEQXRaCFY9
2023-09-13T21:11:29.437+0200 [TRACE] client.alloc_runner.task_coordinator: state transition: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 from=wait_alloc to=poststart
2023-09-13T21:11:29.438+0200 [INFO] client.gc: marking allocation for GC: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180
2023-09-13T21:11:29.438+0200 [TRACE] client.alloc_runner: sending updated alloc: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 client_status=complete desired_status=""
2023-09-13T21:11:29.438+0200 [TRACE] client.alloc_runner: handling task state update: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 done=false
2023-09-13T21:11:29.438+0200 [TRACE] client.alloc_runner: sending updated alloc: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 client_status=complete desired_status=""
2023-09-13T21:11:29.438+0200 [TRACE] client.alloc_runner: handling task state update: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 done=true
2023-09-13T21:11:29.438+0200 [DEBUG] client.alloc_migrator: waiting for previous alloc to terminate: alloc_id=091d2192-748f-297f-9ba0-b78536434174 previous_alloc=f11117e4-b06d-7421-ff07-aa32b1fa5180
2023-09-13T21:11:29.438+0200 [TRACE] client.alloc_runner: sending updated alloc: alloc_id=f11117e4-b06d-7421-ff07-aa32b1fa5180 client_status=complete desired_status=""
So it’s being killed, but I see no reason why. I’m pretty sure no resource limits are violated, and there’s no health checks defined. The task is running fine as far as I can see. So I’m wondering what the reasons could be for a job/task to be killed (this is a job with a single task).
Thank you,
Roel
nomad_agent_log.txt (35.9 KB)
roelvz
September 22, 2023, 8:21am
4
Hello @jrasell . I was looking in the source code to see if I can find the reason why a job would be killed, but don’t know where to start. Maybe you could give me some direction of where I could look for some place in the code where a windows process is killed?