Scheduling yum update and reboot with nomad

jeffreycwitt · January 30, 2022, 5:44pm

I was hoping to use nomad to run a monthly yum update and reboot.

I think I got close but I am running into an issue I need help with.

I have a shell script that calls the yum update. If the update is successful, it should reboot the machine. (see script below)

Initially, it seems to work, but after the reboot, any future yum update fails to work (preventing any future nomad executions of the script from working). The error from the subsequent yum update is:

error: rpmdb: BDB0113 Thread/process 9972/140370306357056 failed: BDB1507 Thread died in Berkeley DB library

(This error and its remedy is identified here: Fix "error: rpmdb: BDB0113 Thread/process - Thread died in Berkeley DB library | rpmdb open failed" - CentOS 7 - The Shell Guru)

Here’s my guess as to what is happening.

When the reboot command is given, Nomad thinks the task has been interrupted. Thus either as the computer shuts down or as it comes back online, it is trying to run the task again – and perhaps running the task at a point when the computer is not in the proper state for such an update – and consequently it corrupts the yum database.

The reason this is my guess is because you can see below that the task actually creates two allocations. The initial attempt at 10:23-17, and then about 33 seconds later at 10:23-51 tries again.

Here’s the event list for the first allocation (8999a801) of the task. You can see that it gets “interrupted” (presumably because the instance is shutting down)

And here you can see that it sort of starts again (c3eeb637). It seems to be creating the allocation at 10:23-51, 2 seconds after the first allocation ends, but then executes 3 minutes later at 10:27-07 (presumably when the computer is back online) My guess is that this is where the yum db is getting corrupted.

I tried to solve this problem, by adding the following:

restart {
       attempts = 0
    }

to the task stanza, so that the it wouldn’t try to create an another allocation when it thinks the task has been interrupted. But adding the above stanza didn’t seem to make a difference.

However, the restart stanza also seems like an improper work around, as it would be better if nomad saw the task as successful and complete once the reboot command had been given. (Rather than simply telling it not to restart when it it thinks the task has failed).

So perhaps my most concrete question is: what’s the proper way to submit a “reboot” task command to nomad?

Below, I provide the script I want to be periodically executed (once a month), followed by the nomad job file.

security-update.sh

#!/bin/bash

echo "scheduled security update; will reboot on successful update"
echo "current date is:"
date
echo "beginning minimal security update"
yum update-minimal --security -y
if [ $? -eq 0 ]; then
   echo "update succeeded; initiating reboot"
   reboot
else
   echo "command did not succeed; no automatic reboot"
fi

security-update-09.nomad

job "security-update-09" {

  datacenters = ["dc1"]

  type = "batch"

  constraint {
    attribute = "${attr.unique.network.ip-address}"
    value = "<the-ip-of-the-instance-i'm-trying-to-update-and-reboot>"
  }

  periodic {
    // launch on the first day of the month
    cron = "0 0 1 * *"

    // Do not allow overlapping runs.
    prohibit_overlap = true
  }

  task "run-update-and-reboot" {

    driver = "raw_exec"

    restart {
       attempts = 0
    }

    config {
      command = "<path>/security-update.sh"
    }
  }

}

Many thanks for any ideas/help.

shantanugadgil · January 31, 2022, 6:58am

This is something I had done in the past, not exactly the same, but the basic idea was the same; “keep the machine packages always up to date”

I don’t have the job script handy, but I will try to list the things done in the script, which were different from your job/script above:

my job was a system job, stuck in a while ((1)) loop, with a sleep delay of about 12 hours.
this avoids the need of writing per machine job.
you continue to use constraints to opt-in machines, it could be node_class or some meta tag
the script startup was checking if the job was within a short duration (~10 mins) from machine startup, i.e. only run if you are within the “short window” from machine startup.
in case I made changes to the job and resubmitted, I didn’t want it to run again
I was deliberately NOT doing any reboot.

I think hitting the error condition of Thread died in Berkeley DB library should be avoided altogether, as it sounds quite serious to me. At the most I would expect the update to run but “do nothing”, but I wouldn’t expect error messages.

Also, I didn’t know that the simple 0/none-zero return values of yum commands could be relied upon for specific update conditions. (TIL)

If I were to enhance my original script, I would check if the yum update installed a newer kernel than the running kernel and then go for a reboot. (off the top of my head, I might indulge in output of rpm -qa or something)

HTH.

jeffreycwitt · February 1, 2022, 5:46pm

Thanks @shantanugadgil that’s really helpful. I will work through some of these suggestions.

But let’s suppose I separate the yum update from the reboot command entirely. In your opinion, is there a “correct” way to run a periodic reboot with nomad or is this a command that should not be schedule with nomad.

If for example, I changed the cron script to simply say reboot, would nomad recognize this as a completed task or would it constantly think the task is being interrupted?

Many thanks for the help already provided.

shantanugadgil · February 2, 2022, 2:30pm

I haven’t done “reboot” inside a cron job, but my guess is that it wouldn’t see the exit 0 of the process and would mark it as failure.

Once you convert the job to a system job, as it is basically a “never ending” job, you can put the logic of update and reboot together.

Tweak the “restart” stanza of the system job to keep the restart attempts far apart; like say 12 hours.

Just be sure to safely sync a couple of times before reboot (for safety)

lgfa29 · February 2, 2022, 7:33pm

Hi @jeffreycwitt

@shantanugadgil already gave you some great tips, but one thing came to mind that may be worth mentioning.

You can try splitting the “restart” part of your script into a poststop that would run after the main task, which would do the update.

I don’t know if this would actually help, but maybe worth a try?

jeffreycwitt · July 28, 2022, 12:51pm

I definitely like this idea. I tried it, but I think the poststop is still a nomad “task” and so nomad is still looking to “complete” the task. But if the “task” is a reboot. It never registers the task as completed; and when the server comes back online it tries it again.

I might not be explaining that correctly, but I tried and I really messed things up for a bit as the server kept coming back on-line and then rebooting

Topic		Replies	Views
Nomad not rescheduling system jobs on nodes that previously ran out of disk space Nomad	2	296	July 7, 2022
Understanding job restart behaviour on lost jobs Nomad	2	1172	May 12, 2022
Control Nomad job restart due to Vault key update Nomad	5	446	February 14, 2023
Question: Allocation status for failed if restart/reschedule is disabled Nomad	0	254	February 8, 2021
Nomad task restarting but never restarts Nomad	3	604	May 19, 2022

Scheduling yum update and reboot with nomad

Related topics