Two Consul clusters mistakenly merged into the same cluster

Background

We have separate Prod and pre Consul clusters, each running with three server nodes. The Consul version is 1.4.1, deployed using consul agent -server, and there was no explicit join configuration between the two environments.

Issue Description

Recently, we deployed a new application in the prod environment and registered it with Consul. Subsequently, we noticed that running consul members showed that all six server

Current Observations

  1. The consul members command shows all six server nodes in a single cluster:
Node                           Address             Status  Type    Build  Protocol  DC   Segment
pre-01             172.16.175.82:8301  alive   server  1.4.1  2         dc1  <all>
pre-02             172.16.165.155:8301 alive   server  1.4.1  2         dc1  <all>
pre-03             172.16.170.145:8301 alive   server  1.4.1  2         dc1  <all>
prod-03            172.16.72.56:8301   alive   server  1.4.1  2         dc1  <all>
prod-04            172.16.72.57:8301   alive   server  1.4.1  2         dc1  <all>
prod-05            172.16.72.58:8301   alive   server  1.4.1  2         dc1  <all>
  1. However, the consul agent startup commands only specify -join within their respective environments, with no explicit cross-environment joins:
    prod cluster
    consul agent -server -config-dir=/etc/consul.d/ -data-dir=/opt/data -node=prod-03 -bind=172.16.72.56 -join=172.16.72.57 -client=0.0.0.0 -ui
    consul agent -server -config-dir=/etc/consul.d/ -data-dir=/opt/data -node=prod-04 -bind=172.16.72.57 -join=172.16.72.56 -client=0.0.0.0 -ui
    consul agent -server -config-dir=/etc/consul.d/ -data-dir=/opt/data -node=prod-05 -bind=172.16.72.58 -join=172.16.72.56 -client=0.0.0.0 -ui

pre cluster
consul agent -server -data-dir=/opt/consul -node=pre-01 -bind=172.16.175.82 -join=172.16.165.155 -client=0.0.0.0 -ui
consul agent -server -data-dir=/opt/consul -node=pre-02 -bind=172.16.165.155 -advertise=172.16.165.155 -join=172.16.175.82 -client=0.0.0.0 -ui
consul agent -server -data-dir=/opt/consul -node=pre-03 -bind=172.16.170.145 -join=172.16.165.155 -client=0.0.0.0 -ui
3. The consul operator raft list-peers command confirms that all six servers are now part of the same Raft cluster.
’‘’
Node ID Address State Voter RaftProtocol
pre-02 16180a1c-c80b-00f8-94a5-718880c355c7 172.16.165.155:8300 follower true 3
pre-03 0f814cd0-ee8f-ce27-be2c-fcaab0c9874c 172.16.170.145:8300 leader true 3
prod-03 c98100a7-a732-aa2e-4b1c-3d8ca66d0a44 172.16.72.56:8300 follower false 3
prod-05 017e0e8b-8ac0-d1ef-d746-f059663bad13 172.16.72.58:8300 follower false 3
prod-04 90490e02-8da6-dd2e-9b44-dccb48530a84 172.16.72.57:8300 follower false 3
pre-01 bcbfa721-9c6a-6460-0cc2-dc855949d83a 172.16.175.82:8300 follower true 3
‘’‘
4. The consul logs show some member-join events, but there is no clear indication of how the servers from different environments joined.
‘’’
2025/03/20 17:03:37 [INFO] serf: EventMemberJoin: crowdmark-5b574684db-z6lg9 172.16.72.130
2025/03/20 17:03:37 [INFO] serf: EventMemberJoin: pre-03.dc1 172.16.170.145
2025/03/20 17:03:37 [INFO] serf: EventMemberJoin: pre-02.dc1 172.16.165.155
2025/03/20 17:03:37 [INFO] serf: EventMemberJoin: pre-01.dc1 172.16.175.82
2025/03/20 17:03:37 [INFO] consul: Handled member-join event for server “pre-03.dc1” in area “wan”
2025/03/20 17:03:37 [INFO] consul: Handled member-join event for server “pre-02.dc1” in area “wan”
2025/03/20 17:03:37 [INFO] consul: Handled member-join event for server “pre-01.dc1” in area “wan”
2025/03/20 17:03:38 [DEBUG] raft-net: 172.16.72.56:8300 accepted connection from: 172.16.165.155:44887
2025/03/20 17:03:38 [WARN] raft: Failed to get previous log: 1702988 log not found (last: 27581655)
2025/03/20 17:03:38 [WARN] raft: Failed to get previous log: 1702987 log not found (last: 27581655)
2025/03/20 17:03:38 [WARN] raft: Failed to get previous log: 1702867 log not found (last: 27581655)
2025/03/20 17:03:38 [WARN] raft: Failed to get previous log: 1702866 log not found (last: 27581655)
2025/03/20 17:03:38 [INFO] serf: EventMemberJoin: pre-03 172.16.170.145
2025/03/20 17:03:38 [INFO] serf: EventMemberJoin: pre-02 172.16.165.155
2025/03/20 17:03:38 [INFO] serf: EventMemberJoin: gateway-5f8b84d7ff-vhs2m 172.16.207.126
2025/03/20 17:03:38 [INFO] consul: Adding LAN server pre-03 (Addr: tcp/172.16.170.145:8300) (DC: dc1)
2025/03/20 17:03:38 [INFO] consul: Adding LAN server pre-02 (Addr: tcp/172.16.165.155:8300) (DC: dc1)
‘’’

Help Needed

  1. How did Consul merge the two clusters, even though we never explicitly joined them?
  2. Besides using different datacenter names, how can we prevent Consul’s Serf Gossip from discovering and merging nodes from different environments?
  3. Is there a way to trace the source of the join event that led to this merge?
  4. Other than using consul operator raft remove-peer, is there a safer way to separate the mistakenly merged servers from the cluster?