Auto scaling group / terminating nomad servers

My nomad servers are behind an amazon auto scaling group. When I need to refresh them, I’ve been simply using the AWS CLI to do this:

aws autoscaling start-instance-refresh --auto-scaling-group-name nomad-servers

Which initiates a replacement of each instance one at a time. Per this AWS documentation that is supposed to initiate a graceful shutdown of all services, before terminating the instance.

So why then is my nomad cluster left in a state where each (of the 3) servers think there are missing servers? If nomad was shut down gracefully, shouldn’t it have gracefully left the cluster?

I have systemd setup to manage nomad using this service. When the instnace launches, this is invoked sudo systemctl enable nomad \ sudo systemctl start nomad

One thing I find questionable(?) about that service is KillSignal=SIGINT – would SIGTERM be more appropriate here?

Here’s what my cluster looks like after this AWS instance refresh. All of these left servers are dead and gone at this point.

$ NOMAD_ADDR=http://10.30.11.6:4646 nomad server members
Name                                                  Address       Port  Status  Leader  Raft Version  Build  Datacenter  Region
ip-10-30-11-6.eu-central-1.compute.internal.global    10.30.11.6    4648  alive   true    3             1.3.5  dc1         global
ip-10-30-21-145.eu-central-1.compute.internal.global  10.30.21.145  4648  alive   false   3             1.3.5  dc1         global
ip-10-30-21-37.eu-central-1.compute.internal.global   10.30.21.37   4648  left    false   3             1.3.5  dc1         global
ip-10-30-31-168.eu-central-1.compute.internal.global  10.30.31.168  4648  left    false   3             1.3.5  dc1         global
ip-10-30-31-246.eu-central-1.compute.internal.global  10.30.31.246  4648  alive   false   3             1.3.5  dc1         global

$ NOMAD_ADDR=http://10.30.21.145:4646 nomad server members
Name                                                  Address       Port  Status  Leader  Raft Version  Build  Datacenter  Region
ip-10-30-11-6.eu-central-1.compute.internal.global    10.30.11.6    4648  alive   true    3             1.3.5  dc1         global
ip-10-30-21-145.eu-central-1.compute.internal.global  10.30.21.145  4648  alive   false   3             1.3.5  dc1         global
ip-10-30-31-246.eu-central-1.compute.internal.global  10.30.31.246  4648  alive   false   3             1.3.5  dc1         global

$ NOMAD_ADDR=http://10.30.31.246:4646 nomad server members
Name                                                  Address       Port  Status  Leader  Raft Version  Build  Datacenter  Region
ip-10-30-11-6.eu-central-1.compute.internal.global    10.30.11.6    4648  alive   true    3             1.3.5  dc1         global
ip-10-30-21-145.eu-central-1.compute.internal.global  10.30.21.145  4648  alive   false   3             1.3.5  dc1         global
ip-10-30-21-37.eu-central-1.compute.internal.global   10.30.21.37   4648  left    false   3             1.3.5  dc1         global
ip-10-30-31-246.eu-central-1.compute.internal.global  10.30.31.246  4648  alive   false   3             1.3.5  dc1         global

I investigated the SIGTERM vs SIGINT thing a bit. Seems like they’re handled the same:

I added the following to my configs and they appear to have no affect. The cluster ends up in the same state:

leave_on_interrupt = true leave_on_terminate = true

Additionally - while this autoscaling replacement is happening, I get dropped connections to the web UI. Admittedly I’m just refreshing the /ui/servers endpoint - but I would expect a graceful rolling restart to not drop connections.

For just a tad more context, I have a load balancer in front of the auto scaling group. The rolling restart is supposed to remove/drain the instance from the load balancer. From the AWS side that happens in all other cases where we use load balancers and ASGs. So the issue feels like it’s on the nomad side.

I think the servers are supposed stay in the list of server members for some delay. I think it is 8 hours.

I too see similar behavior when refreshing my ASG of Nomad servers

You see similar behavior of dropped connections to the web UI? There must be a better way?

I have never bothered much because the browser disconnection lasts for a few seconds/minutes, only during the refresh.

I have not bothered to tweak the ALB settings much as people browsing the GUI on the servers are fine with doing a CTRL+F5 (full refresh) in case ui doesn’t load, or loads halfway.

As the real communication between the compute nodes and the servers is all via IPs and not the DNS endpoint I haven’t thought of it much.

:innocent: :slightly_smiling_face:

this won’t answer your question @josh.m.sharpe as to the “why” – but we did face a similar issue, so we ended up ::

  • running a nomad server cluster of 3
  • create a 1 node ASG for each nomad server, that declares a specific AZ and IP address, via a TF module
    • it seems like things are just easier when you maintain a consistent set of nomad server IPs
  • before “refreshing” a nomad server ::
    • deregister the target from the AWS ALB
    • make it ineligible
    • drain it of jobs
  • “refresh” the node
    • we are actually issuing a “terminate” command, but i assume the ASG refresh option would work

here’s a link to a hacky script we once used that will cycle thru our nomad servers in the way i just described it …

HTH