Here’s the deal: I have a server that I would like to remove from a nomad server cluster and which I would like to re-purpose to another cluster. The removed server must remain at the same IP address. It would seem that once a server is part of gossip there is no way to remove it from the group. How do I get the other servers in the group to forget about my removed server, because every time I start nomad server agent with that IP address my removed server appears in the list of server members. ?? “nomad server force-leave” does nothing.
Never done this, but seems legit:
Thanks for the response - yeah tried that path, but “nomad operate raft list-peers” does not list the other servers in the cluster. I can only see those other servers via “nomad server members” - rather frustrating.
There is an “recover from outage”-guide for Consul concerning raft and peers. Maybe it could help:
Are you shutting down the removed server? nomad server force-leave
tells the remaining servers that the server is gone, but the server will simply try to rejoin unless it’s been shutdown and reconfigured to point to the new cluster.
You’ll need to delete the server’s state on disk in order for it to join an entirely new cluster, as well.
Thanks to you all for your responses. So I may be doing something wrong ( or just stupid ) but does not seem that force-leave does quite what I had hoped. Lets assume I have serverA and serverB, single nodes. While on serverA I run “nomad join serverB” . At this point I now see both servers listed when I run “nomad server members”, as expected. And, when on serverB I run “nomad server members” I again see both servers listed, as expected.
Now, I would expect that if I’m on serverA and I run “nomad server force-leave serverB” that it would change the status to ‘leaving’ as I run “nomad server members” again, and in fact it does. To be safe, I go to serverB and do the opposite and I run “nomad server force-leave serverA” and then the status on both servers is set to ‘leaving’ for the respective opposing server. This is all as expected.
Then, I stop serverB, I completely remove the entire data directory for nomad on serverB and the I restart the nomad server on serverB in debug mode. When serverB starts up I can see in the debug output that it sees only itself in the servers list and indeed serverB shows only serverB when running “nomad server members”. But then, a short time later (about 30 seconds or so), we see a message in the debug output which reads: “[DEBUG] nomad: memberlist: Stream connection from=serverA” and we end up right back where we started! If I do “nomad server members” I see both servers again on both nodes, with “active” status.
So as you can see, this is not my desired result.
Any ideas or thoughts are appreciated, I really need to understand how to correct this.
Can you share your nomad configuration from both nodes with us? If you whipped out the data dir there must be something in the config telling nomad to act like this.
Actually, it looks like I’ve got it! - Thanks to tgross; something he said triggered this for me, he said “the server is gone”. It seems it’s in how you think about this works. It’s in the order events happen.
The trick is one needs to shutdown the server to be removed first and THEN you issue the force-leave command on the server(s) that remain in the cluster. It seems that if you shut down the server to be removed (in my case “serverB”) it has a chance of telling the rest of the cluster that its still alive and thus negates any previous force-leave notification.
Just for the record here a sample of the requested config, they are basically the same on both servers (server1.hcl):
#Increase log verbosity
log_level = “DEBUG”
data_dir = “/mydir/nomad/data/server1”
leave_on_interrupt = true
leave_on_terminate = true
#Enable the server
server {
enabled = true
server_join {
retry_join = [ “localhost” ]
retry_max = 3
retry_interval = “15s”
}
#Self-elect, should be 3 or 5 for production
bootstrap_expect = 1
}
and the command run is as follows:
/mydir/nomad/nomad agent -consul-address=10.10.10.10:8500 -config /mydir/nomad/server1.hcl
Thanks to everyone for their help!