AWS Lambda and Nomad Drain


I am searching for assistance here and maybe exchange some ideas :slight_smile: . Maybe someone esle already have this problem etc.

The idea is have NOMAD cluster in AWS on EC2 instances (auto scaling groups). We have a cluster of 3 Servers and 3 Clients (workers). When we want to upgrade Nomad (or patch the Linux machines). We replace the Workers in the Auto Scaling Group. Like:

  1. Bring 3 NEW Nomad EC2 instances with the NEW Nomad version
  2. Now we have 6 Nomad Clients
  3. Drain the OLD Nomad CLients from the jobs (note that we have jobs with COUNT => 2, and we DO NOT WANT to have downtime) . By default the Nomad migrate stanza is doing drain ONE by ONE - works for us.
  4. Terminate the OLD Nomad clients
  5. We now have a cluster of 3 Nomad clients with the new version of Nomad

That implementation works well, but we want to make AUTOMATIC Drain of the workers with AWS Lambda (Auto Scaling Group triggering event and Lambda executing DRAIN on all the terminates Nomad Clients)

We manage to create the Lambda function and Python script there and it also works well. Here is the problem:

  1. The 3 OLD Nomad clients terminates at the SAME TIME
  2. They send event to Lambda at the same time
  3. I guess 3 parallel executions of Lambda happen
  4. API call is made to NOMAD API for DRAIN
  5. The problems as it seems is that - the NOMAD API receive those request at the same time and it starts DRAINING the Jobs (count => 2) at the same time and we have down time !!. Like it does not respect the migrate stanza and health checks !
  6. NOTE that our job health checks are correct (this works with delayed execution, as you will see down what I wrote)

I tested the executions of the python code from local compute and started the functions with delay of 1-2 seconds - and we DO NOT not down time - it works as expected. Is the NOMAD API problem ? It can not handle such parallel requests ?

I tried putting python SLEEP in the function - that does not help as all the funtion sleep at the same time and call NOMAD API at the same time !

Anybody we same/similar problem ? Any suggestions

Kind Regards

OK, I think I found something that works for me:

  1. Just use RANDOM in the python function and delay the function with the RANDOM time in seconds :slight_smile:

I am not the best programmer out there and could not came with this solution on my own ! :smiley:

Even if your question was some time ago and you found a solution already.
We implemented something similar a few weeks ago. Basically our approach is to leverage the instance refresh option in AWS.
Each autoscaling group has one lifecycle hook for terminating the instance and another one when the new instance is started.
Termination will trigger a lambda which will find the internal ip address based on the metadata given within the hook. With the internal IP-address we can query via nomad api the node-id which will be drained. Afterwards the lifecycle-hook will be notified to continue.
During startup of the new node we validate if the agent is successfully registered in the nomad cluster and finally send the lifecycle hook notification to AWS to continue.
This way the whole update process of our nomad agents is a matter

  1. create new images via packer
  2. terraform the autoscaling group to use the new image
  3. trigger instance refresh on the autoscaling group

@bogue1979 it would helpful to see the Terraform ASG lifecycle hook example, if possible, of course! :slight_smile:

Sorry for the late reply but I created a gist with the basic approach.
Especially the lambda code itself is only a description of the steps we do.

I hope it is helpful:

1 Like