AWS Lambda and Nomad Drain

kereza · November 27, 2020, 2:40pm

Hello,

I am searching for assistance here and maybe exchange some ideas . Maybe someone esle already have this problem etc.

The idea is have NOMAD cluster in AWS on EC2 instances (auto scaling groups). We have a cluster of 3 Servers and 3 Clients (workers). When we want to upgrade Nomad (or patch the Linux machines). We replace the Workers in the Auto Scaling Group. Like:

Bring 3 NEW Nomad EC2 instances with the NEW Nomad version
Now we have 6 Nomad Clients
Drain the OLD Nomad CLients from the jobs (note that we have jobs with COUNT => 2, and we DO NOT WANT to have downtime) . By default the Nomad migrate stanza is doing drain ONE by ONE - works for us.
Terminate the OLD Nomad clients
We now have a cluster of 3 Nomad clients with the new version of Nomad

That implementation works well, but we want to make AUTOMATIC Drain of the workers with AWS Lambda (Auto Scaling Group triggering event and Lambda executing DRAIN on all the terminates Nomad Clients)

We manage to create the Lambda function and Python script there and it also works well. Here is the problem:

The 3 OLD Nomad clients terminates at the SAME TIME
They send event to Lambda at the same time
I guess 3 parallel executions of Lambda happen
API call is made to NOMAD API for DRAIN
The problems as it seems is that - the NOMAD API receive those request at the same time and it starts DRAINING the Jobs (count => 2) at the same time and we have down time !!. Like it does not respect the migrate stanza and health checks !
NOTE that our job health checks are correct (this works with delayed execution, as you will see down what I wrote)

I tested the executions of the python code from local compute and started the functions with delay of 1-2 seconds - and we DO NOT not down time - it works as expected. Is the NOMAD API problem ? It can not handle such parallel requests ?

I tried putting python SLEEP in the function - that does not help as all the funtion sleep at the same time and call NOMAD API at the same time !

Anybody we same/similar problem ? Any suggestions

Kind Regards

kereza · November 27, 2020, 5:32pm

OK, I think I found something that works for me:

Just use RANDOM in the python function and delay the function with the RANDOM time in seconds

I am not the best programmer out there and could not came with this solution on my own !

bogue1979 · December 6, 2020, 5:56pm

Even if your question was some time ago and you found a solution already.
We implemented something similar a few weeks ago. Basically our approach is to leverage the instance refresh option in AWS.
Each autoscaling group has one lifecycle hook for terminating the instance and another one when the new instance is started.
Termination will trigger a lambda which will find the internal ip address based on the metadata given within the hook. With the internal IP-address we can query via nomad api the node-id which will be drained. Afterwards the lifecycle-hook will be notified to continue.
During startup of the new node we validate if the agent is successfully registered in the nomad cluster and finally send the lifecycle hook notification to AWS to continue.
This way the whole update process of our nomad agents is a matter

create new images via packer
terraform the autoscaling group to use the new image
trigger instance refresh on the autoscaling group

shantanugadgil · December 7, 2020, 2:36pm

@bogue1979 it would helpful to see the Terraform ASG lifecycle hook example, if possible, of course!

bogue1979 · December 19, 2020, 12:34pm

Sorry for the late reply but I created a gist with the basic approach.
Especially the lambda code itself is only a description of the steps we do.

I hope it is helpful:

gist.github.com

https://gist.github.com/bogue1979/54726aa0d5a2a60f33efa21bac0e0a61

Nomad_Agent_Autoscaling.tf

# SNS topic lifecycle hooks are sent to
resource "aws_sns_topic" "nomad_graceful_termination_topic" {
  name = "${local.instance_prefix}-nomad_graceful_termination_topic"
}
resource "aws_sns_topic_policy" "nomad_graceful_termination_topic" {
  arn = aws_sns_topic.nomad_graceful_termination_topic.arn
  policy = data.aws_iam_policy_document.nomad_graceful_termination_topic_policy.json
}
data "aws_iam_policy_document" "nomad_graceful_termination_topic_policy" {
  policy_id = "__default_policy_ID"

This file has been truncated. show original

Topic		Replies	Views
Autoscaler: Drain vs Allocation Nomad nomad	0	255	July 3, 2023
Nomad REST API Drain Deadline Nomad	1	663	August 14, 2019
Using AWS load balancer with nomad jobs Nomad	3	650	March 17, 2021
Any info on how to use Nomad and AWS AutoScaling Groups? Nomad	0	362	June 19, 2020
[Nomad Autoscaler] 1 time scaling action Nomad	3	495	December 29, 2021

AWS Lambda and Nomad Drain

Related topics