I’m being asked to take our production clusters from 3 to 5 nodes for greater redundancy. The clusters were created with “bootstrap_expect”:3 and the setting still resides in the config.json. When I added a new server to the cluster it just kicked one of the other 3 out. Is it possible to gracefully add 2 more nodes without rebuilding the cluster? I’ve been unsuccessful in finding the steps to follow.
This is not adequately explained by bootstrap_expect
being set to 3
, since that only triggers bootstrapping of a new configuration, not eviction from an existing cluster, so the highest priority is to understand what happened there. This is not what I would expect to happen in a normally configured cluster.
Some useful things you could post to further that:
- Consul server configuration files
- Output of
consul operator raft list-peers
- Consul server logging, from the leader node of the cluster, during one server kicking another out
consul --version
Consul v1.15.2
Revision 5e08e229
Build Date 2023-03-30T17:51:19Z
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
/etc/consul/config.json
{
"advertise_addr": "192.168.23.30",
"bootstrap_expect": 3,
"client_addr": "0.0.0.0",
"data_dir": "/var/data/consul",
"datacenter": "my_dc",
"dns_config": {},
"enable_syslog": true,
"log_level": "INFO",
"retry_join": [
"192.168.23.30",
"192.168.23.31",
"192.168.23.32"
],
"retry_join_wan": [
"192.169.0.10",
"192.169.0.8",
"192.169.0.9",
"192.171.244.10",
"192.171.244.11",
"192.171.244.9",
"192.180.28.18",
"192.180.28.19",
"192.180.28.20",
"192.181.12.4",
"192.181.12.5",
"192.181.12.6",
"192.183.248.112",
"192.183.248.113",
"192.183.248.114",
"192.184.248.20",
"192.184.248.21",
"192.191.0.30",
"192.191.0.31",
"192.191.0.32",
"192.192.160.249",
"192.192.160.251",
"192.192.160.252",
"192.192.160.253",
"192.192.160.254",
"192.192.47.246",
"192.192.47.251",
"192.192.47.252",
"192.192.47.253",
"192.192.47.254",
"192.193.171.249",
"192.193.171.251",
"192.193.171.252",
"192.193.171.253",
"192.193.171.254",
"192.193.155.249",
"192.193.155.251",
"192.193.155.252",
"192.193.155.253",
"192.193.155.254",
"192.194.0.249",
"192.194.0.251",
"192.194.0.252",
"192.194.0.253",
"192.194.0.254",
"192.194.64.250",
"192.194.64.251",
"192.194.64.252",
"192.194.64.253",
"192.194.64.254"
],
"server": true,
"syslog_facility": "LOCAL3",
"telemetry": {
"disable_hostname": true,
"prometheus_retention_time": "120s"
},
"ui": true
}
consul operator raft list-peers
Node ID Address State Voter RaftProtocol
consul3.team.site.myorg.com 5482fe8a-7b4b-2196-d92d-f5567c3ac74c 192.168.23.30:8300 follower true 3
consul4.team.site.myorg.com 3f5af89d-fd54-07a6-3967-501efab5dadf 192.168.23.31:8300 leader true 3
consul5.team.site.myorg.com 5f24f173-440b-4865-3d89-43b19e044585 192.168.23.32:8300 follower true 3
=== consul monitor from leader (consul4) while turning on the new consul server (consul6.team.site.myorg.com 192.168.23.35)
2023-06-19T21:16:15.972Z [INFO] agent.server.raft: updating peer: peer=5f24f173-440b-4865-3d89-43b19e044585
2023-06-19T21:16:15.974Z [INFO] agent.server: member joined, marking health alive: member=consul6.team.site.myorg.com partition=default
2023-06-19T21:17:14.957Z [ERROR] agent.server.raft: failed to heartbeat to: peer=192.168.23.35:8300 backoff time=10ms error=EOF
2023-06-19T21:17:15.626Z [WARN] agent: error getting server health from server: server=consul6.team.site.myorg.com error="rpc error making call: stream closed"
2023-06-19T21:17:15.963Z [INFO] agent.server: member joined, marking health alive: member=consul6.team.site.myorg.com partition=default
2023-06-19T21:17:15.971Z [INFO] agent.server.raft: updating configuration: command=AddVoter server-id=5f24f173-440b-4865-3d89-43b19e044585 server-addr=192.168.23.32:8300 servers="[{Suffrage:Voter ID:5482fe8a-7b4b-2196-d92d-f5567c3ac74c Address:192.168.23.30:8300} {Suffrage:Voter ID:3f5af89d-fd54-07a6-3967-501efab5dadf Address:192.168.23.31:8300} {Suffrage:Voter ID:5f24f173-440b-4865-3d89-43b19e044585 Address:192.168.23.32:8300}]"
2023-06-19T21:17:15.974Z [INFO] agent.server.raft: updating peer: peer=5f24f173-440b-4865-3d89-43b19e044585
2023-06-19T21:17:15.977Z [INFO] agent.server: member joined, marking health alive: member=consul5.team.site.myorg.com partition=default
2023-06-19T21:17:16.623Z [WARN] agent: error getting server health from server: server=consul6.team.site.myorg.com error="context deadline exceeded"
2023-06-19T21:17:16.832Z [INFO] agent.server.serf.wan: serf: EventMemberUpdate: consul6.team.site.myorg.com.my_dc
2023-06-19T21:17:16.832Z [INFO] agent.server: Handled event for server in area: event=member-update server=consul6.team.site.myorg.com.my_dc area=wan
2023-06-19T21:18:15.965Z [INFO] agent.server.raft: updating configuration: command=AddVoter server-id=5f24f173-440b-4865-3d89-43b19e044585 server-addr=192.168.23.35:8300 servers="[{Suffrage:Voter ID:5482fe8a-7b4b-2196-d92d-f5567c3ac74c Address:192.168.23.30:8300} {Suffrage:Voter ID:3f5af89d-fd54-07a6-3967-501efab5dadf Address:192.168.23.31:8300} {Suffrage:Voter ID:5f24f173-440b-4865-3d89-43b19e044585 Address:192.168.23.35:8300}]"
2023-06-19T21:18:15.970Z [INFO] agent.server.raft: updating peer: peer=5f24f173-440b-4865-3d89-43b19e044585
2023-06-19T21:18:15.973Z [INFO] agent.server: member joined, marking health alive: member=consul6.team.site.myorg.com partition=default
2023-06-19T21:18:15.981Z [INFO] agent.server.raft: updating configuration: command=AddVoter server-id=5f24f173-440b-4865-3d89-43b19e044585 server-addr=192.168.23.32:8300 servers="[{Suffrage:Voter ID:5482fe8a-7b4b-2196-d92d-f5567c3ac74c Address:192.168.23.30:8300} {Suffrage:Voter ID:3f5af89d-fd54-07a6-3967-501efab5dadf Address:192.168.23.31:8300} {Suffrage:Voter ID:5f24f173-440b-4865-3d89-43b19e044585 Address:192.168.23.32:8300}]"
2023-06-19T21:18:16.089Z [INFO] agent.server.raft: updating peer: peer=5f24f173-440b-4865-3d89-43b19e044585
2023-06-19T21:18:16.094Z [INFO] agent.server: member joined, marking health alive: member=consul5.team.site.myorg.com partition=default
2023-06-19T21:18:45.284Z [ERROR] agent.server.raft: failed to heartbeat to: peer=192.168.23.32:8300 backoff time=10ms error=EOF
2023-06-19T21:18:45.596Z [WARN] agent: error getting server health from server: server=consul6.team.site.myorg.com error="rpc error making call: stream closed"
2023-06-19T21:18:46.596Z [WARN] agent: error getting server health from server: server=consul6.team.site.myorg.com error="context deadline exceeded"
2023-06-19T21:18:47.543Z [INFO] agent.server.serf.wan: serf: EventMemberUpdate: consul6.team.site.myorg.com.my_dc
2023-06-19T21:18:47.543Z [INFO] agent.server: Handled event for server in area: event=member-update server=consul6.team.site.myorg.com.my_dc area=wan
2023-06-19T21:18:51.208Z [INFO] agent.server.serf.lan: serf: EventMemberUpdate: consul6.team.site.myorg.com
2023-06-19T21:18:51.208Z [INFO] agent.server: Updating LAN server: server="consul6.team.site.myorg.com (Addr: tcp/192.168.23.35:8300) (DC: my_dc)"
2023-06-19T21:18:51.209Z [INFO] agent.server.raft: updating configuration: command=AddVoter server-id=5f24f173-440b-4865-3d89-43b19e044585 server-addr=192.168.23.35:8300 servers="[{Suffrage:Voter ID:5482fe8a-7b4b-2196-d92d-f5567c3ac74c Address:192.168.23.30:8300} {Suffrage:Voter ID:3f5af89d-fd54-07a6-3967-501efab5dadf Address:192.168.23.31:8300} {Suffrage:Voter ID:5f24f173-440b-4865-3d89-43b19e044585 Address:192.168.23.35:8300}]"
2023-06-19T21:18:51.214Z [INFO] agent.server.raft: updating peer: peer=5f24f173-440b-4865-3d89-43b19e044585
2023-06-19T21:18:51.217Z [INFO] agent.server: member joined, marking health alive: member=consul6.team.site.myorg.com partition=default
2023-06-19T21:19:15.966Z [INFO] agent.server.raft: updating configuration: command=AddVoter server-id=5f24f173-440b-4865-3d89-43b19e044585 server-addr=192.168.23.32:8300 servers="[{Suffrage:Voter ID:5482fe8a-7b4b-2196-d92d-f5567c3ac74c Address:192.168.23.30:8300} {Suffrage:Voter ID:3f5af89d-fd54-07a6-3967-501efab5dadf Address:192.168.23.31:8300} {Suffrage:Voter ID:5f24f173-440b-4865-3d89-43b19e044585 Address:192.168.23.32:8300}]"
2023-06-19T21:19:15.970Z [INFO] agent.server.raft: updating peer: peer=5f24f173-440b-4865-3d89-43b19e044585
2023-06-19T21:19:15.972Z [INFO] agent.server: member joined, marking health alive: member=consul5.team.site.myorg.com partition=default
2023-06-19T21:19:15.995Z [INFO] agent.server.raft: updating configuration: command=AddVoter server-id=5f24f173-440b-4865-3d89-43b19e044585 server-addr=192.168.23.35:8300 servers="[{Suffrage:Voter ID:5482fe8a-7b4b-2196-d92d-f5567c3ac74c Address:192.168.23.30:8300} {Suffrage:Voter ID:3f5af89d-fd54-07a6-3967-501efab5dadf Address:192.168.23.31:8300} {Suffrage:Voter ID:5f24f173-440b-4865-3d89-43b19e044585 Address:192.168.23.35:8300}]"
2023-06-19T21:19:16.045Z [INFO] agent.server.raft: updating peer: peer=5f24f173-440b-4865-3d89-43b19e044585
2023-06-19T21:19:16.049Z [INFO] agent.server: member joined, marking health alive: member=consul6.team.site.myorg.com partition=default
2023-06-19T21:20:15.968Z [INFO] agent.server.raft: updating configuration: command=AddVoter server-id=5f24f173-440b-4865-3d89-43b19e044585 server-addr=192.168.23.32:8300 servers="[{Suffrage:Voter ID:5482fe8a-7b4b-2196-d92d-f5567c3ac74c Address:192.168.23.30:8300} {Suffrage:Voter ID:3f5af89d-fd54-07a6-3967-501efab5dadf Address:192.168.23.31:8300} {Suffrage:Voter ID:5f24f173-440b-4865-3d89-43b19e044585 Address:192.168.23.32:8300}]"
2023-06-19T21:20:15.974Z [INFO] agent.server.raft: updating peer: peer=5f24f173-440b-4865-3d89-43b19e044585
2023-06-19T21:20:15.977Z [INFO] agent.server: member joined, marking health alive: member=consul5.team.site.myorg.com partition=default
2023-06-19T21:20:15.984Z [ERROR] agent.server.raft: failed to heartbeat to: peer=192.168.23.32:8300 backoff time=10ms error=EOF
2023-06-19T21:20:17.842Z [INFO] agent.server.serf.wan: serf: EventMemberUpdate: consul6.team.site.myorg.com.my_dc
2023-06-19T21:20:17.842Z [INFO] agent.server: Handled event for server in area: event=member-update server=consul6.team.site.myorg.com.my_dc area=wan
2023-06-19T21:20:21.596Z [WARN] agent: error getting server health from server: server=consul6.team.site.myorg.com error="rpc error making call: stream closed"
2023-06-19T21:20:22.596Z [WARN] agent: error getting server health from server: server=consul6.team.site.myorg.com error="context deadline exceeded"
2023-06-19T21:20:43.503Z [WARN] agent.dns: Skipping invalid node for NS records: node=consul3.team.site.myorg.com
2023-06-19T21:20:43.503Z [WARN] agent.dns: Skipping invalid node for NS records: node=consul4.team.site.myorg.com
2023-06-19T21:20:43.503Z [WARN] agent.dns: Skipping invalid node for NS records: node=consul5.team.site.myorg.com
2023-06-19T21:21:15.967Z [INFO] agent.server.raft: updating configuration: command=AddVoter server-id=5f24f173-440b-4865-3d89-43b19e044585 server-addr=192.168.23.35:8300 servers="[{Suffrage:Voter ID:5482fe8a-7b4b-2196-d92d-f5567c3ac74c Address:192.168.23.30:8300} {Suffrage:Voter ID:3f5af89d-fd54-07a6-3967-501efab5dadf Address:192.168.23.31:8300} {Suffrage:Voter ID:5f24f173-440b-4865-3d89-43b19e044585 Address:192.168.23.35:8300}]"
2023-06-19T21:21:15.972Z [INFO] agent.server.raft: updating peer: peer=5f24f173-440b-4865-3d89-43b19e044585
2023-06-19T21:21:15.975Z [INFO] agent.server: member joined, marking health alive: member=consul6.team.site.myorg.com partition=default
2023-06-19T21:21:16.140Z [INFO] agent.server.raft: updating configuration: command=AddVoter server-id=5f24f173-440b-4865-3d89-43b19e044585 server-addr=192.168.23.32:8300 servers="[{Suffrage:Voter ID:5482fe8a-7b4b-2196-d92d-f5567c3ac74c Address:192.168.23.30:8300} {Suffrage:Voter ID:3f5af89d-fd54-07a6-3967-501efab5dadf Address:192.168.23.31:8300} {Suffrage:Voter ID:5f24f173-440b-4865-3d89-43b19e044585 Address:192.168.23.32:8300}]"
2023-06-19T21:21:16.184Z [INFO] agent.server.raft: updating peer: peer=5f24f173-440b-4865-3d89-43b19e044585
2023-06-19T21:21:16.190Z [INFO] agent.server: member joined, marking health alive: member=consul5.team.site.myorg.com partition=default
2023-06-19T21:21:46.872Z [ERROR] agent.server.raft: failed to heartbeat to: peer=192.168.23.32:8300 backoff time=10ms error=EOF
2023-06-19T21:21:47.596Z [WARN] agent: error getting server health from server: server=consul6.team.site.myorg.com error="rpc error making call: stream closed"
2023-06-19T21:21:47.945Z [INFO] agent.server.serf.wan: serf: EventMemberUpdate: consul6.team.site.myorg.com.my_dc
2023-06-19T21:21:47.945Z [INFO] agent.server: Handled event for server in area: event=member-update server=consul6.team.site.myorg.com.my_dc area=wan
2023-06-19T21:21:48.596Z [WARN] agent: error getting server health from server: server=consul6.team.site.myorg.com error="context deadline exceeded"
2023-06-19T21:22:15.968Z [INFO] agent.server.raft: updating configuration: command=AddVoter server-id=5f24f173-440b-4865-3d89-43b19e044585 server-addr=192.168.23.35:8300 servers="[{Suffrage:Voter ID:5482fe8a-7b4b-2196-d92d-f5567c3ac74c Address:192.168.23.30:8300} {Suffrage:Voter ID:3f5af89d-fd54-07a6-3967-501efab5dadf Address:192.168.23.31:8300} {Suffrage:Voter ID:5f24f173-440b-4865-3d89-43b19e044585 Address:192.168.23.35:8300}]"
2023-06-19T21:22:15.972Z [INFO] agent.server.raft: updating peer: peer=5f24f173-440b-4865-3d89-43b19e044585
2023-06-19T21:22:15.974Z [INFO] agent.server: member joined, marking health alive: member=consul6.team.site.myorg.com partition=default
2023-06-19T21:22:24.002Z [INFO] agent.server.serf.lan: serf: EventMemberUpdate: consul6.team.site.myorg.com
2023-06-19T21:22:24.002Z [INFO] agent.server: Updating LAN server: server="consul6.team.site.myorg.com (Addr: tcp/192.168.23.35:8300) (DC: my_dc)"
2023-06-19T21:22:24.002Z [INFO] agent.server: member joined, marking health alive: member=consul6.team.site.myorg.com partition=default
2023-06-19T21:23:15.970Z [INFO] agent.server.raft: updating configuration: command=AddVoter server-id=5f24f173-440b-4865-3d89-43b19e044585 server-addr=192.168.23.32:8300 servers="[{Suffrage:Voter ID:5482fe8a-7b4b-2196-d92d-f5567c3ac74c Address:192.168.23.30:8300} {Suffrage:Voter ID:3f5af89d-fd54-07a6-3967-501efab5dadf Address:192.168.23.31:8300} {Suffrage:Voter ID:5f24f173-440b-4865-3d89-43b19e044585 Address:192.168.23.32:8300}]"
2023-06-19T21:23:15.975Z [INFO] agent.server.raft: updating peer: peer=5f24f173-440b-4865-3d89-43b19e044585
2023-06-19T21:23:15.978Z [INFO] agent.server: member joined, marking health alive: member=consul5.team.site.myorg.com partition=default
2023-06-19T21:23:16.842Z [ERROR] agent.server.raft: failed to heartbeat to: peer=192.168.23.32:8300 backoff time=10ms error=EOF
2023-06-19T21:23:18.438Z [INFO] agent.server.serf.lan: serf: EventMemberUpdate: consul6.team.site.myorg.com
2023-06-19T21:23:18.438Z [INFO] agent.server: Updating LAN server: server="consul6.team.site.myorg.com (Addr: tcp/192.168.23.35:8300) (DC: my_dc)"
2023-06-19T21:23:18.438Z [INFO] agent.server.raft: updating configuration: command=AddVoter server-id=5f24f173-440b-4865-3d89-43b19e044585 server-addr=192.168.23.35:8300 servers="[{Suffrage:Voter ID:5482fe8a-7b4b-2196-d92d-f5567c3ac74c Address:192.168.23.30:8300} {Suffrage:Voter ID:3f5af89d-fd54-07a6-3967-501efab5dadf Address:192.168.23.31:8300} {Suffrage:Voter ID:5f24f173-440b-4865-3d89-43b19e044585 Address:192.168.23.35:8300}]"
2023-06-19T21:23:18.444Z [INFO] agent.server.raft: updating peer: peer=5f24f173-440b-4865-3d89-43b19e044585
2023-06-19T21:23:18.447Z [INFO] agent.server: member joined, marking health alive: member=consul6.team.site.myorg.com partition=default
2023-06-19T21:23:18.586Z [INFO] agent.server.serf.wan: serf: EventMemberUpdate: consul6.team.site.myorg.com.my_dc
2023-06-19T21:23:18.587Z [INFO] agent.server: Handled event for server in area: event=member-update server=consul6.team.site.myorg.com.my_dc area=wan
2023-06-19T21:23:21.595Z [WARN] agent: error getting server health from server: server=consul6.team.site.myorg.com error="rpc error making call: stream closed"
2023-06-19T21:23:22.596Z [WARN] agent: error getting server health from server: server=consul6.team.site.myorg.com error="context deadline exceeded"
/usr/bin/consul operator raft list-peers
Node ID Address State Voter RaftProtocol
consul3.team.site.myorg.com 5482fe8a-7b4b-2196-d92d-f5567c3ac74c 192.168.23.30:8300 follower true 3
consul4.team.site.myorg.com 3f5af89d-fd54-07a6-3967-501efab5dadf 192.168.23.31:8300 leader true 3
consul6.team.site.myorg.com 5f24f173-440b-4865-3d89-43b19e044585 192.168.23.35:8300 follower true 3
Aha, that reveals the issue.
Each node in a Consul server cluster is identified by a node ID. This is the UUID showing up in the consul operator raft list-peers
output and at various points in the log.
Somehow, both your old and new node have the same UUID, and this is causing Consul’s clustering to treat this as one node moving IP address.
The only way I can think of that this would occur, is if you were copying the old node’s data directory to the new node before starting it up. This isn’t necessary, and as observed, is actually harmful. A new Consul node can be started up with an empty data directory, and will find the cluster via the retry_join
addresses in the configuration file, and transfer the a copy of the cluster data automatically over the network - there’s no need to pre-copy the cluster data.
Thank you so much, @maxb! I had cloned one of the existing nodes to build a new one, and I blindly ran “rm -f /var/lib/consul/node-id” without realizing that this is not the configured data location for this node.