Hello,
I am searching for assistance here and maybe exchange some ideas . Maybe someone esle already have this problem etc.
The idea is have NOMAD cluster in AWS on EC2 instances (auto scaling groups). We have a cluster of 3 Servers and 3 Clients (workers). When we want to upgrade Nomad (or patch the Linux machines). We replace the Workers in the Auto Scaling Group. Like:
- Bring 3 NEW Nomad EC2 instances with the NEW Nomad version
- Now we have 6 Nomad Clients
- Drain the OLD Nomad CLients from the jobs (note that we have jobs with COUNT => 2, and we DO NOT WANT to have downtime) . By default the Nomad migrate stanza is doing drain ONE by ONE - works for us.
- Terminate the OLD Nomad clients
- We now have a cluster of 3 Nomad clients with the new version of Nomad
That implementation works well, but we want to make AUTOMATIC Drain of the workers with AWS Lambda (Auto Scaling Group triggering event and Lambda executing DRAIN on all the terminates Nomad Clients)
We manage to create the Lambda function and Python script there and it also works well. Here is the problem:
- The 3 OLD Nomad clients terminates at the SAME TIME
- They send event to Lambda at the same time
- I guess 3 parallel executions of Lambda happen
- API call is made to NOMAD API for DRAIN
- The problems as it seems is that - the NOMAD API receive those request at the same time and it starts DRAINING the Jobs (count => 2) at the same time and we have down time !!. Like it does not respect the migrate stanza and health checks !
- NOTE that our job health checks are correct (this works with delayed execution, as you will see down what I wrote)
I tested the executions of the python code from local compute and started the functions with delay of 1-2 seconds - and we DO NOT not down time - it works as expected. Is the NOMAD API problem ? It can not handle such parallel requests ?
I tried putting python SLEEP in the function - that does not help as all the funtion sleep at the same time and call NOMAD API at the same time !
Anybody we same/similar problem ? Any suggestions
Kind Regards