when updating a running binary to a new version, how long can i expect it to take nomad to stop the running process, and then reschedule the new updated binary?
i’m trying to estimate how long of a recent write log I need to be storing in my databases, to handle replication of missed writes during the above update process. don’t need exact numbers, obviously it varies, but I’m wondering whether the ballpark is… 10ms, 100ms, 1s, 3s…?
@victorstewart
The primary driver of restart speed is how quickly your workload stops itself. For example, some databases do an aggressive flush to disk as part of their shutdown process which can cause them to take a significant amount of time.
Other factors that determine how quickly a task restarts (during an orderly update)
-
the task driver -
-
exec
- The Nomad exec driver requires time to build the chroot. This time is directly predicated how many files have to be linked (or copied into the container). If you are not making significant changes to the filesystems that are included in the chroot, this should be fairly constant. However, this can take seconds to minutes depending on your client node.
-
docker
- The docker task driver’s time to restart is all about the amount of time necessary to fetch a container image onto the node. You should be able to estimate this based on a docker pull
.
Other job related factors that will increase a job’s starting time include:
- The
artifact
stanza - The time it takes for an artifact to be fetched to the client node and unzipped if it is an archive happens as part of the allocation start
- The
template
stanza - Templates that depend on a Consul or Vault value can be delayed in rendering if there is an issue with connecting to those services. This will delay an allocation from starting.
I say all of this to give you a less than satisfying, “it depends”, but this will help you experimentally determine your unique restart time given your circumstances. In a well ordered replacement, the scheduler adds minimal time to a restart; however, there are instances where you might be unexpectedly delayed—for example, in cases where there are insufficient resources to start the replacement job of a count=1. There is fairly useful timing information emitted in the Allocation Events that could help you time an upgrade to build your estimate.
Lastly, this is a place in which rolling upgrades with canaries would be exceptionally useful, since you could avoid be taking your workload instances to 0 in the process of upgrading versions.
Hope this helps you out!
Best,
Charlie Voiselle
Product Education Engineer, Nomad