We have recently started leveraging the
consul lock command to ensure that we have singleton behavior for certain jobs we are running in our Nomad Cluster. This has been working really well to ensure exactly one execution is running at any given time.
But we ran into an issue when we updated our Consul Server nodes in that the jobs that had an active lock session were suddenly dropped which also aggressively terminated the processes that were locked - the Consul cluster was available throughout the rolling upgrade process.
The hunch we have is that due to us using a Load Balanced endpoint the
consul lock process could not resolve the new Consul Server nodes correctly (IP caching?) and therefore could not re-establish the link.
This would therefore be resolved by leveraging a local consul agent instead which would handle connecting to the new Consul nodes automatically.
So I am searching for clarification on two points here:
- Is it possible for a
consul locksession to be moved to a new Consul Server node during a rolling update?
- If yes what sort of timeouts are recommended to ensure enough time is given to prevent the process being killed?
- Is my understanding of what caused the issue correct? We have already migrated to using Consul Agents instead of the load balanced endpoint but want to ensure the next rolling update won’t break my jobs!
I am going to be running my own tests here to check what does happen in both cases in a test rig but as I could not find the answers in the documentation wanted to pose the question anyway!