Consul client Incarnation issue

Hello All,
I need your expertise to dig one of the issue we are seeing.

On one of server when we bring consul agent up we are seeing below error.
None of the services we register to consul from this node are getting registered.

On Client (affected node)we are seeing this error:
The node-ID changes on every restart.

2021-01-04T06:55:54.705Z [ERROR] agent.anti_entropy: failed to sync remote state: error=“rpc error making call: rpc error making call: failed inserting node: Error while renaming Node ID: “1e9b74ee-45da-f5f5-ba74-50174f2dc808”: Node name XXXX is reserved by node f3c4e098-a13a-25fb-3993-821bc8ec3874 with name XXXX (xx.yy.aa.bb)”
2021-01-04T06:55:54.705Z [ERROR] agent.client: RPC failed to server: method=Catalog.Register server=aa.bb.cc.dd:8300 error=“rpc error making call: rpc error making call: failed inserting node: Error while renaming Node ID: “1e9b74ee-45da-f5f5-ba74-50174f2dc808”: Node name XXXX is reserved by node f3c4e098-a13a-25fb-3993-821bc8ec3874 with name XXXX (xx.yy.aa.bb)”

On Consul Server we are seeing this error:
The node-id on server keeps on changing and its keep printing the same messages a lot. May be all other nodes are notifying Server to rename the Node ID.

2021-01-04T07:00:14.126Z [WARN] agent.fsm: EnsureRegistration failed: error=“failed inserting node: Error while renaming Node ID: “e0952d94-f83a-a801-def0-627a36e27f67”: Node name XXXX is reserved by node f3c4e098-a13a-25fb-3993-821bc8ec3874 with name XXXX (xx.yy.aa.bb)”
2021-01-04T07:00:11.870Z [WARN] agent.fsm: EnsureRegistration failed: error=“failed inserting node: Error while renaming Node ID: “6703e45f-a646-d0a3-4c6a-b585871d6227”: Node name XXXX is reserved by node f3c4e098-a13a-25fb-3993-821bc8ec3874 with name XXXX (xx.yy.aa.bb)”
2021-01-04T07:00:11.068Z [WARN] agent.fsm: EnsureRegistration failed: error=“failed inserting node: Error while renaming Node ID: “1e9b74ee-45da-f5f5-ba74-50174f2dc808”: Node name XXXX is reserved by node f3c4e098-a13a-25fb-3993-821bc8ec3874 with name XXXX (xx.yy.aa.bb)”

We took tcpdump to understand more and can see this message in frame:
The node-id from each source keeps changing
Dump frame output:

…>…=…E…T@.@.?../b m m…E…W…Addr…&'.Incarnation…+S.Meta…}…role.node.dc.dc25.id…$46901295-9710-8885-157e-60f9b27f4df9.vsn_min.2.acls.0.segment…vsn.2.vsn_max.3.build.1.7.0:95fb95bf.Node.XXXX.Port. m.Vsn…

Now to resolve this we already tried below and its not working:

  1. Leaving the node gracefully from node XXX (It was successfully executed).
  2. Deleted consul data directory to perform full clean up
  3. Forcefully leave with purge option from Consul Server. After running this from we were not seeing node XXXX in consul members output.
    But after running consul catalog nodes the node is still visible in alive state.
  4. Monitored logs on splunk for messages to disappear from all the nodes in mesh .It took more than 15 mins to notify all the nodes in mesh.
  5. Monitored consul members command from few nodes it was removed from all nodes.
  6. Node was still showing in consul catalog nodes command as mentioned above.
  7. After 15 mins all nodes started showing the node is not reachable even if in consul members we are unable to see the node.
  8. I started the agent back and same issue started again.

Please help with your recommendations on this issue.
Thanks in advance.

Hi, what is your consul version?

Hi @sunny760408
Thanks for your reply. We managed to fix the issue.
It was config issue on few nodes which created this.

Hi , Can you please eloborate the fix details?, we are also facing the same issue.

1 Like