How to track down what killed a task?

A strange thing happened this morning — all my long-running tasks (e.g. redis, traefik) were all killed and restarted at the same exact time. All I’m seeing in the events however is Sent interrupt. Waiting 5s before force killing . I’ve looked in /var/log/syslog and not seeing anything relevant. I’m on Ubuntu 22.04 and also using Consul.

I’m having a rough time figuring out what caused these tasks to be killed. How can I track down what caused this? Thanks

Do you have automated system updates enabled? It seems like what would happen if docker was auto-updated and the daemon was restarted. Anything in /var/log/unattended-upgrades ?

Sorry I assumed that you were using docker but you don’t explicitly say that in your message. It could be the libc update that rolled out which required restarts on your tasks.

Thanks. Nothing in /var/log/unattended-upgrades logs. Yep I am using Docker, is the libc update something that happened recently and wouldn’t appear in the unattended upgrades logs? How can I check if that was the culprit and also disable that from happening in the future?

I ran a ps -p <PID> -o etime on /usr/bin/dockerd as well and it returned

65-14:41:02

So it doesn’t seem to have restarted (i.e. > 65 days uptime). Does that mean it’s not the culprit?

My mistake, looks like it was a kernel update that rolled out and your uptime indicates that nothing was rebooted. I saw notices on a few of my machines about a libc update but that appears to be just for the dev packages.

Anything in kernel.log that might indicate the processes were killed for a reason? I think normally syslog would catch all kernel messages but it never hurts to check the other logs.

Makes sense to consider it, thanks!

In /var/log/kern.log I am seeing only these entries which is at the same time my tasks were killed

Aug 16 08:09:33 ingress-7t6k kernel: [5660295.385198] docker0: port 1(veth7fc7fb8) entered disabled state
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.386888] vethd091305: renamed from eth0
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.441040] docker0: port 2(vethcce8523) entered blocking state
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.441045] docker0: port 2(vethcce8523) entered disabled state
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.441194] device vethcce8523 entered promiscuous mode
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.441397] docker0: port 2(vethcce8523) entered blocking state
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.441400] docker0: port 2(vethcce8523) entered forwarding state
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.446265] docker0: port 1(veth7fc7fb8) entered disabled state
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.447468] device veth7fc7fb8 left promiscuous mode
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.447474] docker0: port 1(veth7fc7fb8) entered disabled state
Aug 16 08:09:34 ingress-7t6k kernel: [5660296.081478] eth0: renamed from vethee86baa
Aug 16 08:09:34 ingress-7t6k kernel: [5660296.097421] IPv6: ADDRCONF(NETDEV_CHANGE): vethcce8523: link becomes ready

Does that reveal anything?

Is this a VM of some kind? Looks like your primary network interfaced got reconfigured either by the host or perhaps it was toggled on the vSwitch or physical switch. Not sure why docker would restart the containers unless they were in host mode. But I admit I don’t know a lot about docker networking.

Yep this is a Google Cloud instance. One of the tasks is in host mode (traefik) but the other tasks that were also killed aren’t (but are using bridge networking).

Gotcha, so I guess this appears to be a Docker issue then?

I asked around the Docker community about those logs and they mentioned those entries are actually from the result of the containers being killed, so not sure if it’s due to Docker.

Is there no way to track down the reason for tasks being killed in Nomad?

Hard to say. If you can find other instance where your network interfaces got juggled and docker didn’t restart your containers then yeah I’d say its a docker issue. If you can’t find other instances then its inconclusive. I could see how docker might be designed to restart containers if the underlying network interfaces get changed so it might have been working correctly.

Good luck!