How to track down what killed a task?

axsuul · August 16, 2022, 5:10pm

A strange thing happened this morning — all my long-running tasks (e.g. redis, traefik) were all killed and restarted at the same exact time. All I’m seeing in the events however is Sent interrupt. Waiting 5s before force killing . I’ve looked in /var/log/syslog and not seeing anything relevant. I’m on Ubuntu 22.04 and also using Consul.

I’m having a rough time figuring out what caused these tasks to be killed. How can I track down what caused this? Thanks

foozmeat · August 16, 2022, 5:23pm

Do you have automated system updates enabled? It seems like what would happen if docker was auto-updated and the daemon was restarted. Anything in /var/log/unattended-upgrades ?

foozmeat · August 16, 2022, 5:25pm

Sorry I assumed that you were using docker but you don’t explicitly say that in your message. It could be the libc update that rolled out which required restarts on your tasks.

axsuul · August 16, 2022, 5:28pm

Thanks. Nothing in /var/log/unattended-upgrades logs. Yep I am using Docker, is the libc update something that happened recently and wouldn’t appear in the unattended upgrades logs? How can I check if that was the culprit and also disable that from happening in the future?

axsuul · August 16, 2022, 5:33pm

I ran a ps -p <PID> -o etime on /usr/bin/dockerd as well and it returned

65-14:41:02

So it doesn’t seem to have restarted (i.e. > 65 days uptime). Does that mean it’s not the culprit?

foozmeat · August 16, 2022, 5:44pm

My mistake, looks like it was a kernel update that rolled out and your uptime indicates that nothing was rebooted. I saw notices on a few of my machines about a libc update but that appears to be just for the dev packages.

Anything in kernel.log that might indicate the processes were killed for a reason? I think normally syslog would catch all kernel messages but it never hurts to check the other logs.

axsuul · August 16, 2022, 5:47pm

Makes sense to consider it, thanks!

In /var/log/kern.log I am seeing only these entries which is at the same time my tasks were killed

Aug 16 08:09:33 ingress-7t6k kernel: [5660295.385198] docker0: port 1(veth7fc7fb8) entered disabled state
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.386888] vethd091305: renamed from eth0
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.441040] docker0: port 2(vethcce8523) entered blocking state
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.441045] docker0: port 2(vethcce8523) entered disabled state
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.441194] device vethcce8523 entered promiscuous mode
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.441397] docker0: port 2(vethcce8523) entered blocking state
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.441400] docker0: port 2(vethcce8523) entered forwarding state
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.446265] docker0: port 1(veth7fc7fb8) entered disabled state
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.447468] device veth7fc7fb8 left promiscuous mode
Aug 16 08:09:33 ingress-7t6k kernel: [5660295.447474] docker0: port 1(veth7fc7fb8) entered disabled state
Aug 16 08:09:34 ingress-7t6k kernel: [5660296.081478] eth0: renamed from vethee86baa
Aug 16 08:09:34 ingress-7t6k kernel: [5660296.097421] IPv6: ADDRCONF(NETDEV_CHANGE): vethcce8523: link becomes ready

Does that reveal anything?

foozmeat · August 16, 2022, 5:49pm

Is this a VM of some kind? Looks like your primary network interfaced got reconfigured either by the host or perhaps it was toggled on the vSwitch or physical switch. Not sure why docker would restart the containers unless they were in host mode. But I admit I don’t know a lot about docker networking.

axsuul · August 16, 2022, 5:53pm

Yep this is a Google Cloud instance. One of the tasks is in host mode (traefik) but the other tasks that were also killed aren’t (but are using bridge networking).

Gotcha, so I guess this appears to be a Docker issue then?

axsuul · August 16, 2022, 5:59pm

I asked around the Docker community about those logs and they mentioned those entries are actually from the result of the containers being killed, so not sure if it’s due to Docker.

Is there no way to track down the reason for tasks being killed in Nomad?

foozmeat · August 16, 2022, 7:05pm

Hard to say. If you can find other instance where your network interfaces got juggled and docker didn’t restart your containers then yeah I’d say its a docker issue. If you can’t find other instances then its inconclusive. I could see how docker might be designed to restart containers if the underlying network interfaces get changed so it might have been working correctly.

Good luck!

Topic		Replies	Views
Reason for a task being killed Nomad	3	510	September 22, 2023
Unmonitored Tasks Nomad	8	637	August 23, 2022
Dead batch tasks piling up Nomad	0	208	December 20, 2022
Nomad Alloc not stopping forcefully Nomad	13	968	April 21, 2023
RecoverTask for custom Driver Nomad	2	364	January 11, 2022

How to track down what killed a task?

Related topics