Hi all,
Me and my team are currently experiencing something very weird/odd when a consul agent tries to join a consul cluster.
Situation:
We are currently maintaining multiple consul datacenters in our infra and each consul datacenter has a lot of local consul agents that are joining it on a daily basis.
Since a week or two, we have upgraded to consul version 1.7.0 and since then we are experiencing a very weird problem which causes that no more consul agents can join/retry_join a consul cluster/datacenter anymore.
The problem:
From time to time we have a consul agent that has a conflicting node ID
with another member. When we checked which consul node was conflicting with the command: consul members
, we discovered that the node was already in the left
state.
# consul members | grep <ip_of_conflicting_node>
<ip_of_conflicting_node> <ip_of_conflicting_node> left client 1.7.0 2 <DC> <default>
<ip_of_conflicting_node> <ip_of_conflicting_node> alive client 1.7.0 2 <DC> <default>
But the problem continued and all the other freshly spinned consul nodes could not join the consul DC/cluster because it was always reporting that another consul node had a conflicting node ID with another member.
So the entire joining process after that was being blocked by that conflicting node.
Failed to join <ip_of_consul_dc>: Member ' <ip_of_conflicting_node>' has conflicting node ID <node_id> with member <ip_of_conflicting_node>
What we manually tried:
We tried to (force) remove the conflicting node with the command:
consul force-leave node
But this did not work! It reported that the node could not be found.
Could someone explain this?