How to recover single-node Nomad cluster after IP change?

I run a single node of Consul and Nomad. I’m fine with the lack of redundancy given my use case (non-critical home lab stuff on Raspberry Pi).

Here are some details about my (simple) setup for posterity

Consul agent config:

data_dir = "/opt/consul"
server = true
bind_addr = "{{ GetPrivateIP }}"
addresses = {
  http = "{{ GetPrivateIP }}"
}
bootstrap_expect = 1

Nomad agent config

data_dir  = "/opt/nomad/data"
bind_addr = "{{ GetPrivateIP }}"

server {
  enabled          = true
  bootstrap_expect = 1
}

client {
  enabled = true
  servers = ["{{ GetPrivateIP }}"]
}

I have also modified the following two lines in the Nomad service file:

Wants=consul.service
After=consul.service

both Consul and Nomad were installed from the official APT repo as per instructions at Downloads | Nomad by HashiCorp
and also launched both via systemd, i.e. systemctl start consul && systemctl start nomad


While I was working on my network, I had to replace the router, which means that all devices leased new (different) IPs from the (new) DHCP server, including the RPI, where both Consul and Nomad run.

Now Consul was able to recover easily (and internally raft elected a new leader), it was enough to just restart the service via systemctl restart consul and cluster was available again under the new IP. See full log at gist:ef8a77d24696ba449b4038a85874bdbb · GitHub

Nomad however got stuck in a state I don’t know how to recover from, even after reading Outage Recovery | Nomad - HashiCorp Learn

This is a full log after restart (after new IP was leased & after Consul was restarted and running): gist:f17444857718e6b847192d8c0bdfa464 · GitHub

It is repeatedly issuing this error:

nomad: failed to reconcile: error="error removing server with duplicate ID \"68f3e132-263c-a7ea-c6e8-54858f65bac9\": need at least one voter in configuration: {[]}

For posterity, the old IP was 10.20.1.116 and new one 10.20.1.117.

$ nomad server members
Name                     Address      Port  Status  Leader  Raft Version  Build  Datacenter  Region
ubuntu.global  10.20.1.117  4648  alive   true    3             1.3.2  dc1         global
$ nomad operator raft list-peers
Node       ID                                    Address           State     Voter  RaftProtocol
(unknown)  68f3e132-263c-a7ea-c6e8-54858f65bac9  10.20.1.116:4647  follower  true   unknown
$ nomad operator raft remove-peer -peer-id=68f3e132-263c-a7ea-c6e8-54858f65bac9
Error removing peer: Unexpected response code: 500 (need at least one voter in configuration: {[]})

Which leaves me with two questions:

  1. How do I now (manually) recover from this state?
  2. Is there any way I can configure the server such that it can recover automatically after restart, just like Consul does?

Related:

In the last one @angrycub mentions the following

If you have to change the IP address of your node (or it could be changed in a restart—less of a concern in a bare-metal situation), you will either have to do peers.json recovery or wipe your state. The IP address is a component of the member ID in the raft data.

but for someone new to Nomad, it’s not clear to me where to find the peers.json file? Also it’s still not clear to me why Nomad doesn’t auto-recover the same way Consul does.

Perhaps you’re looking for something that you yourself have to create? The peers.json file is created by you the operator to define the desired state in the case of a recovery, is placed in the data directory, then read by Nomad when restarted and subsequently deleted.

Have you tried this?

Nomad keeps a node id of the machine in the data directory, which could be confusing it, trying to recover the old node. If you remove it and place peers.json there instead, you should be able to recover.

My Nomad data dir is /opt/nomad and the node-id file is kept in /opt/nomad/server/node-id . It’s a guid which uniquely identifies that instance in Nomad’s Raft (I think).