Consul agent has conflicting node ID

bogdankatishev · March 3, 2020, 2:03pm

Hi all,

Me and my team are currently experiencing something very weird/odd when a consul agent tries to join a consul cluster.

Situation:

We are currently maintaining multiple consul datacenters in our infra and each consul datacenter has a lot of local consul agents that are joining it on a daily basis.

Since a week or two, we have upgraded to consul version 1.7.0 and since then we are experiencing a very weird problem which causes that no more consul agents can join/retry_join a consul cluster/datacenter anymore.

The problem:

From time to time we have a consul agent that has a conflicting node ID with another member. When we checked which consul node was conflicting with the command: consul members, we discovered that the node was already in the left state.

# consul members | grep  <ip_of_conflicting_node>
<ip_of_conflicting_node>           <ip_of_conflicting_node>    left    client  1.7.0  2         <DC>  <default>
<ip_of_conflicting_node>           <ip_of_conflicting_node>    alive   client  1.7.0  2         <DC>  <default>

But the problem continued and all the other freshly spinned consul nodes could not join the consul DC/cluster because it was always reporting that another consul node had a conflicting node ID with another member.

So the entire joining process after that was being blocked by that conflicting node.

Failed to join <ip_of_consul_dc>: Member ' <ip_of_conflicting_node>' has conflicting node ID <node_id> with member <ip_of_conflicting_node>

What we manually tried:

We tried to (force) remove the conflicting node with the command:

consul force-leave node

But this did not work! It reported that the node could not be found.

Could someone explain this?

Wolfsrudel · March 3, 2020, 2:15pm

Did you upgrade raft, too? Maybe this could help: https://learn.hashicorp.com/consul/day-2-operations/outage#failure-of-a-single-server-cluster-after-upgrading-raft-protocol

bogdankatishev · March 4, 2020, 6:28am

The problem is not in the raft protocol. The communication between our consul servers within the consul datacenter is fine. The problem is on the consul client agent.

Some additional information: we upgraded from version 1.4.1 to 1.7.0.

I also have the question: why does the entire chain of consul joining breaks when literally 1 agent is corrupt. Why can consul just not remove/ignore this and procceed the joinings of other consul nodes.

Or am I missing something here?

bogdankatishev · March 4, 2020, 6:40am

As you can see in the output, the raft protocol that we are using is by default 3.

# consul operator raft list-peers
Node                 ID         Address              State     Voter  RaftProtocol
<consul_server_1>  <node_id>  <ip_consul_server_1>  leader    true   3
<consul_server_2>  <node_id>  <ip_consul_server_2>  follower  true   3
<consul_server_3>  <node_id>  <ip_consul_server_3>  follower  true   3

pierresouchay · March 8, 2020, 9:05am

Do a consul leave on alive agent.
Wait 10s to ensure propagation of leave is Ok in the whole cluster
restart consul agent on

bogdankatishev · March 23, 2020, 1:23pm

We ended up opening an issue ticket in consul: https://github.com/hashicorp/consul/issues/7396

Topic		Replies	Views
Duplicate nodes and Conflicting node ID Consul	1	2916	September 10, 2019
Consul client fails to join the cluster on pod restart Consul	0	310	April 8, 2022
Consul member Failed to join but doesn't exist Consul	0	370	July 8, 2021
Consul issue: part of wrong data center Consul	10	2818	April 12, 2021
Consul server going out of cluster Consul	0	387	September 22, 2020

Consul agent has conflicting node ID

Related topics