Consul debug hangs

version: 1.4.5

Consul server: 107/108/109

Fault phenomenon:

The three nodes execute “consul operator raft list-peers” and display the results

107 is the leader, 108, 109 are followers, but 109’s “Voter=false”

Executing the “consul operator raft list-peers -stale” command on node 109 will hang

Executing the “consul debug” command on node 109 will hang

View the consul log on the leader node:


Mar 30 21:59:46 abcd-efg.test.com consul[29535]: consul: error getting server health from "prod-database-3": last request still outstanding

Mar 30 21:59:46 abcd-efg.test.com consul[29535]: 2022/03/30 21:59:46 [ERR] raft: Failed to get log at index 9243146: log not found

Mar 30 21:59:46 abcd-efg.test.com consul[29535]: raft: Failed to get log at index 9243146: log not found

Mar 30 21:59:46 abcd-efg.test.com consul[29535]: 2022/03/30 21:59:46 [ERR] raft: Failed to install snapshot 40487-9588053-1648643431750: read tcp xxx.xxx.xxx.108:65044->xxx.xxx.xxx.109:8300: read: connection reset by peer

Mar 30 21:59:46 abcd-efg.test.com consul[29535]: raft: Failed to install snapshot 40487-9588053-1648643431750: read tcp xxx.xxx.xxx.108:65044->xxx.xxx.xxx.109:8300: read: connection reset by peer

Mar 30 21:59:46 abcd-efg.test.com consul[29535]: 2022/03/30 21:59:46 [ERR] raft: Failed to send snapshot to {Nonvoter 8077e00d-277b-b6e0 -580e-5b468eb1dc8c xxx.xxx.xxx.109:8300}: read tcp xxx.xxx.xxx.108:65044->xxx.xxx.xxx.109:8300: read: connection reset by peer

Mar 30 21:59:46 abcd-efg.test.com consul[29535]: raft: Failed to send snapshot to {Nonvoter 8077e00d-277b-b6e0-580e-5b468eb1dc8c xxx.xxx.xxx.109:8300}: read tcp xxx.xxx.xxx.108:65044->xxx.xxx.xxx.109:8300: read: connection reset by peer

Mar 30 21:59:48 abcd-efg.test.com consul[29535]: 2022/03/30 21:59:48 [WARN] consul: error getting server health from "prod-database-3" : last request still outstanding

Mar 30 21:59:48 abcd-efg.test.com consul[29535]: consul: error getting server health from "prod-database-3": last request still outstanding

View the consul log on node 109:


The following logs are always displayed:

DEBUG raft-net: xxx.xxx.xxx.109:8300 accepted connection from : xxx.xxx.xxx.108:xxxxx

109 nodes execute “ss -antp |grep -c 8300” The number of TCP connections will continue to increase

The tcp connection has been initiated from 108 to 109:8300, and viewed at node 109, the number of connections is in the ESTAB state, and will continue to increase, directly displaying too many open files.

Restarting the 109 consul service and operating system did not solve the problem.

It should be noted that the consul cluster has been running for 15 months, and the problem occurred recently. The system or consul configuration has not been changed during this period

Please, what is the reason for the above problem? How to restore?

thanks

Is it possible that the 109 node is having storage or memory issues? Something is hanging up the process it seems.

Hi, aram, thanks for your reply, can you briefly explain how to troubleshoot storage and memory issues? thanks
Currently, both system storage and memory are left over.

Both Vault and Consul are very susceptible to latency and timeout, so a small amount of network latency or simply a memory contention could cause a lot of issues without using all of the resources.

The problem has been solved. After investigation, it was found that the firewall intercepted and reset the TCP connection of the leader node connected to node 109:8300.
The reason for the interception is that there is a rule in the firewall to intercept repeated requests, which hits the action of the leader node connecting 109 nodes through TCP: 8300.
thank you for you help! !