Background
We have separate Prod and pre Consul clusters, each running with three server nodes. The Consul version is 1.4.1, deployed using consul agent -server
, and there was no explicit join
configuration between the two environments.
Issue Description
Recently, we deployed a new application in the prod environment and registered it with Consul. Subsequently, we noticed that running consul members
showed that all six server
Current Observations
- The
consul members
command shows all six server nodes in a single cluster:
Node Address Status Type Build Protocol DC Segment
pre-01 172.16.175.82:8301 alive server 1.4.1 2 dc1 <all>
pre-02 172.16.165.155:8301 alive server 1.4.1 2 dc1 <all>
pre-03 172.16.170.145:8301 alive server 1.4.1 2 dc1 <all>
prod-03 172.16.72.56:8301 alive server 1.4.1 2 dc1 <all>
prod-04 172.16.72.57:8301 alive server 1.4.1 2 dc1 <all>
prod-05 172.16.72.58:8301 alive server 1.4.1 2 dc1 <all>
- However, the
consul agent
startup commands only specify-join
within their respective environments, with no explicit cross-environment joins:
prod cluster
consul agent -server -config-dir=/etc/consul.d/ -data-dir=/opt/data -node=prod-03 -bind=172.16.72.56 -join=172.16.72.57 -client=0.0.0.0 -ui
consul agent -server -config-dir=/etc/consul.d/ -data-dir=/opt/data -node=prod-04 -bind=172.16.72.57 -join=172.16.72.56 -client=0.0.0.0 -ui
consul agent -server -config-dir=/etc/consul.d/ -data-dir=/opt/data -node=prod-05 -bind=172.16.72.58 -join=172.16.72.56 -client=0.0.0.0 -ui
pre cluster
consul agent -server -data-dir=/opt/consul -node=pre-01 -bind=172.16.175.82 -join=172.16.165.155 -client=0.0.0.0 -ui
consul agent -server -data-dir=/opt/consul -node=pre-02 -bind=172.16.165.155 -advertise=172.16.165.155 -join=172.16.175.82 -client=0.0.0.0 -ui
consul agent -server -data-dir=/opt/consul -node=pre-03 -bind=172.16.170.145 -join=172.16.165.155 -client=0.0.0.0 -ui
3. The consul operator raft list-peers
command confirms that all six servers are now part of the same Raft cluster.
’‘’
Node ID Address State Voter RaftProtocol
pre-02 16180a1c-c80b-00f8-94a5-718880c355c7 172.16.165.155:8300 follower true 3
pre-03 0f814cd0-ee8f-ce27-be2c-fcaab0c9874c 172.16.170.145:8300 leader true 3
prod-03 c98100a7-a732-aa2e-4b1c-3d8ca66d0a44 172.16.72.56:8300 follower false 3
prod-05 017e0e8b-8ac0-d1ef-d746-f059663bad13 172.16.72.58:8300 follower false 3
prod-04 90490e02-8da6-dd2e-9b44-dccb48530a84 172.16.72.57:8300 follower false 3
pre-01 bcbfa721-9c6a-6460-0cc2-dc855949d83a 172.16.175.82:8300 follower true 3
‘’‘
4. The consul logs
show some member-join
events, but there is no clear indication of how the servers from different environments joined.
‘’’
2025/03/20 17:03:37 [INFO] serf: EventMemberJoin: crowdmark-5b574684db-z6lg9 172.16.72.130
2025/03/20 17:03:37 [INFO] serf: EventMemberJoin: pre-03.dc1 172.16.170.145
2025/03/20 17:03:37 [INFO] serf: EventMemberJoin: pre-02.dc1 172.16.165.155
2025/03/20 17:03:37 [INFO] serf: EventMemberJoin: pre-01.dc1 172.16.175.82
2025/03/20 17:03:37 [INFO] consul: Handled member-join event for server “pre-03.dc1” in area “wan”
2025/03/20 17:03:37 [INFO] consul: Handled member-join event for server “pre-02.dc1” in area “wan”
2025/03/20 17:03:37 [INFO] consul: Handled member-join event for server “pre-01.dc1” in area “wan”
2025/03/20 17:03:38 [DEBUG] raft-net: 172.16.72.56:8300 accepted connection from: 172.16.165.155:44887
2025/03/20 17:03:38 [WARN] raft: Failed to get previous log: 1702988 log not found (last: 27581655)
2025/03/20 17:03:38 [WARN] raft: Failed to get previous log: 1702987 log not found (last: 27581655)
2025/03/20 17:03:38 [WARN] raft: Failed to get previous log: 1702867 log not found (last: 27581655)
2025/03/20 17:03:38 [WARN] raft: Failed to get previous log: 1702866 log not found (last: 27581655)
2025/03/20 17:03:38 [INFO] serf: EventMemberJoin: pre-03 172.16.170.145
2025/03/20 17:03:38 [INFO] serf: EventMemberJoin: pre-02 172.16.165.155
2025/03/20 17:03:38 [INFO] serf: EventMemberJoin: gateway-5f8b84d7ff-vhs2m 172.16.207.126
2025/03/20 17:03:38 [INFO] consul: Adding LAN server pre-03 (Addr: tcp/172.16.170.145:8300) (DC: dc1)
2025/03/20 17:03:38 [INFO] consul: Adding LAN server pre-02 (Addr: tcp/172.16.165.155:8300) (DC: dc1)
‘’’
Help Needed
- How did Consul merge the two clusters, even though we never explicitly joined them?
- Besides using different
datacenter
names, how can we prevent Consul’s Serf Gossip from discovering and merging nodes from different environments? - Is there a way to trace the source of the
join
event that led to this merge? - Other than using
consul operator raft remove-peer
, is there a safer way to separate the mistakenly merged servers from the cluster?