Disconnect nomad task from network at sigkill, not sigterm

mashton · June 26, 2024, 3:03pm

We have a service (an application) in a nomad cluster, in a docker container, that needs to do some cleanup after receiving SIGTERM. This includes some outbound requests-- both HTTP and SFTP.

Initially, Nomad was disconnecting the service from the network as soon as it was instructed to “Kill” (misleading in this case, since it sends SIGTERM). We’ve set kill_timeout to 30s, and eventually we found shutdown_delay which we set to 30s. The following is the behavior we’re hoping to achieve:

New jobspec is received
Nomad sends SIGTERM
Nomad does NOT disconnect the network
app exits code zero, or 30s passes and Nomad sends SIGKILL
Nomad disconnects network

But with kill_timeout: 30s and shutdown_delay: 30s, here’s what’s happening (actual logs with timestamps here):

16:19:53 -0400 Waiting for shutdown delay Waiting for shutdown_delay of 30s before killing the task.

16:19:53 -0400 Killing Sent interrupt. Waiting 30s before force killing

16:20:23 -0400 [ App receives SIGTERM ] ← App log

16:20:27 -0400 Terminated Exit Code: 0

16:20:27 -0400 Killed Task successfully killed

As you can see, it says it’s waiting 30s shutdown_delay and says it’s sending SIGTERM at the same time (that would be great!), but the app doesn’t receive SIGTERM until after 30s.

I cannot tell if this is expected behavior. If it is, is there a configuration I’m missing that can get me what I need? If it isn’t (surely I haven’t found a bug…), could there be a missing configuration at the docker level?

Kamilcuk · June 26, 2024, 8:33pm

Hi. I started Nomad 1.8.1 in -dev mode. The following job specification:

job "test-sigterm" {
  type = "batch"
  meta {
    uuid = uuidv4()
  }
  group "example" {
    task "example" {
      driver = "docker"
      config {
        image = "bash"
        args = [
          "sh",
          "-xc",
          <<EOF
          trap 'echo SIGTERM; wait' SIGTERM
          trap 'echo EXIT' EXIT
          sleep infinity &
          wait
          EOF
        ]
      }
    }
  }
}

Shows that SIGTERM is printed by the shell script right after Nomad prints “Killing sent interrupt”. So I can confirm that SIGTERM is properly sent.

Consider instead on concentrating on debugging and profiling your application. What is most probably happening, which happens often in our environent, is that your application is run inside a shell script which is running inside a shell script which is running inside a shell script that doesn’t forward signals properly. You might be interested in docker run --init and similarly config { init = true } Nomad job spec option, and you might be interested in research signals in docker, in particular what is tini and how to sent a signal to all processes inside a process group (after creating a process group), in particular in TINI_KILL_PROCESS_GROUP environment variable.

Or alternatively, your application is too slow to response. “Killing sent interrupt” executes docker signal command, it is what it is. For example, your application or docker log commands are fully buffered. Is the timestamp printed by your application? There are many unknowns.

Additionally, it is unclear to me. What is “nomad disconnects network”? How does and what network does nomad connect that it can then “disconnect”? I am not clear what “network” does nomad “connect”.

mashton · July 1, 2024, 4:39pm

Thanks-- when I say “Nomad disconnects from the network” I’m talking about Nomad removing the Task’s service registrations, as per the docs on shutdown_delay.

And thanks for putting together that test. Good to confirm that under normal circumstances Nomad sends SIGTERM as expected. Your test doesn’t seem to take shutdown_delay into account. I may try to reproduce your test with that, but I’m not sure I would expect perfect production parity here locally.

Topic		Replies	Views
Nomad Alloc not stopping forcefully Nomad	13	915	April 21, 2023
Stopping the Nomad Jobs gracefully Nomad	12	1791	December 21, 2022
Job constantly restarted by SIGTERM and no clue why Nomad	4	541	April 25, 2023
Keep the managed process alive in a nomad job after shutting down the nomad agent Nomad	2	901	June 2, 2021
Job restarts delay not working as expected Nomad	1	424	November 9, 2022

Disconnect nomad task from network at sigkill, not sigterm

Related topics