Consul agent has conflicting node ID

Hi all,

Me and my team are currently experiencing something very weird/odd when a consul agent tries to join a consul cluster.

Situation:

We are currently maintaining multiple consul datacenters in our infra and each consul datacenter has a lot of local consul agents that are joining it on a daily basis.

Since a week or two, we have upgraded to consul version 1.7.0 and since then we are experiencing a very weird problem which causes that no more consul agents can join/retry_join a consul cluster/datacenter anymore.

The problem:

From time to time we have a consul agent that has a conflicting node ID with another member. When we checked which consul node was conflicting with the command: consul members, we discovered that the node was already in the left state.

# consul members | grep  <ip_of_conflicting_node>
<ip_of_conflicting_node>           <ip_of_conflicting_node>    left    client  1.7.0  2         <DC>  <default>
<ip_of_conflicting_node>           <ip_of_conflicting_node>    alive   client  1.7.0  2         <DC>  <default>

But the problem continued and all the other freshly spinned consul nodes could not join the consul DC/cluster because it was always reporting that another consul node had a conflicting node ID with another member.

So the entire joining process after that was being blocked by that conflicting node.

Failed to join <ip_of_consul_dc>: Member ' <ip_of_conflicting_node>' has conflicting node ID <node_id> with member <ip_of_conflicting_node>

What we manually tried:

We tried to (force) remove the conflicting node with the command:

consul force-leave node

But this did not work! It reported that the node could not be found.

Could someone explain this?

1 Like

Did you upgrade raft, too? Maybe this could help: https://learn.hashicorp.com/consul/day-2-operations/outage#failure-of-a-single-server-cluster-after-upgrading-raft-protocol

The problem is not in the raft protocol. The communication between our consul servers within the consul datacenter is fine. The problem is on the consul client agent.

Some additional information: we upgraded from version 1.4.1 to 1.7.0.

I also have the question: why does the entire chain of consul joining breaks when literally 1 agent is corrupt. Why can consul just not remove/ignore this and procceed the joinings of other consul nodes.

Or am I missing something here?

As you can see in the output, the raft protocol that we are using is by default 3.

# consul operator raft list-peers
Node                 ID         Address              State     Voter  RaftProtocol
<consul_server_1>  <node_id>  <ip_consul_server_1>  leader    true   3
<consul_server_2>  <node_id>  <ip_consul_server_2>  follower  true   3
<consul_server_3>  <node_id>  <ip_consul_server_3>  follower  true   3

  1. Do a consul leave on alive agent.
  2. Wait 10s to ensure propagation of leave is Ok in the whole cluster
  3. restart consul agent on

We ended up opening an issue ticket in consul: https://github.com/hashicorp/consul/issues/7396