I run a single node of Consul and Nomad. I’m fine with the lack of redundancy given my use case (non-critical home lab stuff on Raspberry Pi).
Here are some details about my (simple) setup for posterity
Consul agent config:
data_dir = "/opt/consul"
server = true
bind_addr = "{{ GetPrivateIP }}"
addresses = {
http = "{{ GetPrivateIP }}"
}
bootstrap_expect = 1
Nomad agent config
data_dir = "/opt/nomad/data"
bind_addr = "{{ GetPrivateIP }}"
server {
enabled = true
bootstrap_expect = 1
}
client {
enabled = true
servers = ["{{ GetPrivateIP }}"]
}
I have also modified the following two lines in the Nomad service file:
Wants=consul.service
After=consul.service
both Consul and Nomad were installed from the official APT repo as per instructions at Install | Nomad | HashiCorp Developer
and also launched both via systemd, i.e. systemctl start consul && systemctl start nomad
While I was working on my network, I had to replace the router, which means that all devices leased new (different) IPs from the (new) DHCP server, including the RPI, where both Consul and Nomad run.
Now Consul was able to recover easily (and internally raft elected a new leader), it was enough to just restart the service via systemctl restart consul
and cluster was available again under the new IP. See full log at gist:ef8a77d24696ba449b4038a85874bdbb · GitHub
Nomad however got stuck in a state I don’t know how to recover from, even after reading Outage Recovery | Nomad | HashiCorp Developer
This is a full log after restart (after new IP was leased & after Consul was restarted and running): gist:f17444857718e6b847192d8c0bdfa464 · GitHub
It is repeatedly issuing this error:
nomad: failed to reconcile: error="error removing server with duplicate ID \"68f3e132-263c-a7ea-c6e8-54858f65bac9\": need at least one voter in configuration: {[]}
For posterity, the old IP was 10.20.1.116
and new one 10.20.1.117
.
$ nomad server members
Name Address Port Status Leader Raft Version Build Datacenter Region
ubuntu.global 10.20.1.117 4648 alive true 3 1.3.2 dc1 global
$ nomad operator raft list-peers
Node ID Address State Voter RaftProtocol
(unknown) 68f3e132-263c-a7ea-c6e8-54858f65bac9 10.20.1.116:4647 follower true unknown
$ nomad operator raft remove-peer -peer-id=68f3e132-263c-a7ea-c6e8-54858f65bac9
Error removing peer: Unexpected response code: 500 (need at least one voter in configuration: {[]})
Which leaves me with two questions:
- How do I now (manually) recover from this state?
- Is there any way I can configure the server such that it can recover automatically after restart, just like Consul does?
Related:
- Job persistence - single server cluster - #3 by XenoPhage
- Nomad stop working once DHCP lease of machine expires
- 1 Machine per Cluster - #3 by angrycub
In the last one @angrycub mentions the following
If you have to change the IP address of your node (or it could be changed in a restart—less of a concern in a bare-metal situation), you will either have to do peers.json recovery or wipe your state. The IP address is a component of the member ID in the raft data.
but for someone new to Nomad, it’s not clear to me where to find the peers.json
file? Also it’s still not clear to me why Nomad doesn’t auto-recover the same way Consul does.