How to recover single-node Nomad cluster after IP change?

I run a single node of Consul and Nomad. I’m fine with the lack of redundancy given my use case (non-critical home lab stuff on Raspberry Pi).

Here are some details about my (simple) setup for posterity

Consul agent config:

data_dir = "/opt/consul"
server = true
bind_addr = "{{ GetPrivateIP }}"
addresses = {
  http = "{{ GetPrivateIP }}"
}
bootstrap_expect = 1

Nomad agent config

data_dir  = "/opt/nomad/data"
bind_addr = "{{ GetPrivateIP }}"

server {
  enabled          = true
  bootstrap_expect = 1
}

client {
  enabled = true
  servers = ["{{ GetPrivateIP }}"]
}

I have also modified the following two lines in the Nomad service file:

Wants=consul.service
After=consul.service

both Consul and Nomad were installed from the official APT repo as per instructions at Downloads | Nomad by HashiCorp
and also launched both via systemd, i.e. systemctl start consul && systemctl start nomad


While I was working on my network, I had to replace the router, which means that all devices leased new (different) IPs from the (new) DHCP server, including the RPI, where both Consul and Nomad run.

Now Consul was able to recover easily (and internally raft elected a new leader), it was enough to just restart the service via systemctl restart consul and cluster was available again under the new IP. See full log at gist:ef8a77d24696ba449b4038a85874bdbb · GitHub

Nomad however got stuck in a state I don’t know how to recover from, even after reading Outage Recovery | Nomad - HashiCorp Learn

This is a full log after restart (after new IP was leased & after Consul was restarted and running): gist:f17444857718e6b847192d8c0bdfa464 · GitHub

It is repeatedly issuing this error:

nomad: failed to reconcile: error="error removing server with duplicate ID \"68f3e132-263c-a7ea-c6e8-54858f65bac9\": need at least one voter in configuration: {[]}

For posterity, the old IP was 10.20.1.116 and new one 10.20.1.117.

$ nomad server members
Name                     Address      Port  Status  Leader  Raft Version  Build  Datacenter  Region
ubuntu.global  10.20.1.117  4648  alive   true    3             1.3.2  dc1         global
$ nomad operator raft list-peers
Node       ID                                    Address           State     Voter  RaftProtocol
(unknown)  68f3e132-263c-a7ea-c6e8-54858f65bac9  10.20.1.116:4647  follower  true   unknown
$ nomad operator raft remove-peer -peer-id=68f3e132-263c-a7ea-c6e8-54858f65bac9
Error removing peer: Unexpected response code: 500 (need at least one voter in configuration: {[]})

Which leaves me with two questions:

  1. How do I now (manually) recover from this state?
  2. Is there any way I can configure the server such that it can recover automatically after restart, just like Consul does?

Related:

In the last one @angrycub mentions the following

If you have to change the IP address of your node (or it could be changed in a restart—less of a concern in a bare-metal situation), you will either have to do peers.json recovery or wipe your state. The IP address is a component of the member ID in the raft data.

but for someone new to Nomad, it’s not clear to me where to find the peers.json file? Also it’s still not clear to me why Nomad doesn’t auto-recover the same way Consul does.

Perhaps you’re looking for something that you yourself have to create? The peers.json file is created by you the operator to define the desired state in the case of a recovery, is placed in the data directory, then read by Nomad when restarted and subsequently deleted.

Have you tried this?

Nomad keeps a node id of the machine in the data directory, which could be confusing it, trying to recover the old node. If you remove it and place peers.json there instead, you should be able to recover.

My Nomad data dir is /opt/nomad and the node-id file is kept in /opt/nomad/server/node-id . It’s a guid which uniquely identifies that instance in Nomad’s Raft (I think).

Hey @radeksimko :wave:

Nomad creates a peers.info file in the ${NOMAD_DATA_DIR}/server/raft folder. You can use this file as the start of a peers.json file. It contains instructions for nodes configured with Raft protocol v2 or Raft protocol v3, so you will need to use the correct JSON format based on the Raft protocol that your cluster is using.

For raft v3, the id field contains the value that is stored in the ${NOMAD_DATA_DIR}/server/node_id file on that node. The address should be the node’s IP address with the RPC port (4647 by default). The non_voter value should be false. Peers.json recovery is an offline technique that replaces the Raft suffrage information with the values provided in the file. So when recovering multiple nodes in a single cluster, you should stop Nomad on all of the servers, copy your created peers.json file into the ${NOMAD_DATA_DIR}/server/raft folder and then restart Nomad on the servers. You will know that your peers.json recovery took effect if:

  • a log line is emitted telling you that it read the peers.json file
  • the peers.json file is deleted by the Nomad process once it has been ingested.

I’ll have to look into the differences between how Consul is managing the member list in cases where the suffrage information changes dramatically; however, this process can get a single node cluster back into the running without wiping the state completely.

Hope this helps!
-cv


Charlie Voiselle (@angrycub)
Engineering - Nomad, HashiCorp