version: 1.4.5
Consul server: 107/108/109
Fault phenomenon:
The three nodes execute “consul operator raft list-peers” and display the results
107 is the leader, 108, 109 are followers, but 109’s “Voter=false”
Executing the “consul operator raft list-peers -stale” command on node 109 will hang
Executing the “consul debug” command on node 109 will hang
View the consul log on the leader node:
Mar 30 21:59:46 abcd-efg.test.com consul[29535]: consul: error getting server health from "prod-database-3": last request still outstanding
Mar 30 21:59:46 abcd-efg.test.com consul[29535]: 2022/03/30 21:59:46 [ERR] raft: Failed to get log at index 9243146: log not found
Mar 30 21:59:46 abcd-efg.test.com consul[29535]: raft: Failed to get log at index 9243146: log not found
Mar 30 21:59:46 abcd-efg.test.com consul[29535]: 2022/03/30 21:59:46 [ERR] raft: Failed to install snapshot 40487-9588053-1648643431750: read tcp xxx.xxx.xxx.108:65044->xxx.xxx.xxx.109:8300: read: connection reset by peer
Mar 30 21:59:46 abcd-efg.test.com consul[29535]: raft: Failed to install snapshot 40487-9588053-1648643431750: read tcp xxx.xxx.xxx.108:65044->xxx.xxx.xxx.109:8300: read: connection reset by peer
Mar 30 21:59:46 abcd-efg.test.com consul[29535]: 2022/03/30 21:59:46 [ERR] raft: Failed to send snapshot to {Nonvoter 8077e00d-277b-b6e0 -580e-5b468eb1dc8c xxx.xxx.xxx.109:8300}: read tcp xxx.xxx.xxx.108:65044->xxx.xxx.xxx.109:8300: read: connection reset by peer
Mar 30 21:59:46 abcd-efg.test.com consul[29535]: raft: Failed to send snapshot to {Nonvoter 8077e00d-277b-b6e0-580e-5b468eb1dc8c xxx.xxx.xxx.109:8300}: read tcp xxx.xxx.xxx.108:65044->xxx.xxx.xxx.109:8300: read: connection reset by peer
Mar 30 21:59:48 abcd-efg.test.com consul[29535]: 2022/03/30 21:59:48 [WARN] consul: error getting server health from "prod-database-3" : last request still outstanding
Mar 30 21:59:48 abcd-efg.test.com consul[29535]: consul: error getting server health from "prod-database-3": last request still outstanding
View the consul log on node 109:
The following logs are always displayed:
DEBUG raft-net: xxx.xxx.xxx.109:8300 accepted connection from : xxx.xxx.xxx.108:xxxxx
109 nodes execute “ss -antp |grep -c 8300” The number of TCP connections will continue to increase
The tcp connection has been initiated from 108 to 109:8300, and viewed at node 109, the number of connections is in the ESTAB state, and will continue to increase, directly displaying too many open files.
Restarting the 109 consul service and operating system did not solve the problem.
It should be noted that the consul cluster has been running for 15 months, and the problem occurred recently. The system or consul configuration has not been changed during this period
Please, what is the reason for the above problem? How to restore?
thanks