Consul debug hangs

hongruiji · March 31, 2022, 7:47am

version: 1.4.5

Consul server: 107/108/109

Fault phenomenon:

The three nodes execute “consul operator raft list-peers” and display the results

107 is the leader, 108, 109 are followers, but 109’s “Voter=false”

Executing the “consul operator raft list-peers -stale” command on node 109 will hang

Executing the “consul debug” command on node 109 will hang

View the consul log on the leader node:


Mar 30 21:59:46 abcd-efg.test.com consul[29535]: consul: error getting server health from "prod-database-3": last request still outstanding

Mar 30 21:59:46 abcd-efg.test.com consul[29535]: 2022/03/30 21:59:46 [ERR] raft: Failed to get log at index 9243146: log not found

Mar 30 21:59:46 abcd-efg.test.com consul[29535]: raft: Failed to get log at index 9243146: log not found

Mar 30 21:59:46 abcd-efg.test.com consul[29535]: 2022/03/30 21:59:46 [ERR] raft: Failed to install snapshot 40487-9588053-1648643431750: read tcp xxx.xxx.xxx.108:65044->xxx.xxx.xxx.109:8300: read: connection reset by peer

Mar 30 21:59:46 abcd-efg.test.com consul[29535]: raft: Failed to install snapshot 40487-9588053-1648643431750: read tcp xxx.xxx.xxx.108:65044->xxx.xxx.xxx.109:8300: read: connection reset by peer

Mar 30 21:59:46 abcd-efg.test.com consul[29535]: 2022/03/30 21:59:46 [ERR] raft: Failed to send snapshot to {Nonvoter 8077e00d-277b-b6e0 -580e-5b468eb1dc8c xxx.xxx.xxx.109:8300}: read tcp xxx.xxx.xxx.108:65044->xxx.xxx.xxx.109:8300: read: connection reset by peer

Mar 30 21:59:46 abcd-efg.test.com consul[29535]: raft: Failed to send snapshot to {Nonvoter 8077e00d-277b-b6e0-580e-5b468eb1dc8c xxx.xxx.xxx.109:8300}: read tcp xxx.xxx.xxx.108:65044->xxx.xxx.xxx.109:8300: read: connection reset by peer

Mar 30 21:59:48 abcd-efg.test.com consul[29535]: 2022/03/30 21:59:48 [WARN] consul: error getting server health from "prod-database-3" : last request still outstanding

Mar 30 21:59:48 abcd-efg.test.com consul[29535]: consul: error getting server health from "prod-database-3": last request still outstanding

View the consul log on node 109:


The following logs are always displayed:

DEBUG raft-net: xxx.xxx.xxx.109:8300 accepted connection from : xxx.xxx.xxx.108:xxxxx

109 nodes execute “ss -antp |grep -c 8300” The number of TCP connections will continue to increase

The tcp connection has been initiated from 108 to 109:8300, and viewed at node 109, the number of connections is in the ESTAB state, and will continue to increase, directly displaying too many open files.

Restarting the 109 consul service and operating system did not solve the problem.

It should be noted that the consul cluster has been running for 15 months, and the problem occurred recently. The system or consul configuration has not been changed during this period

Please, what is the reason for the above problem? How to restore?

thanks

aram · March 31, 2022, 9:35am

Is it possible that the 109 node is having storage or memory issues? Something is hanging up the process it seems.

hongruiji · April 2, 2022, 2:39am

Hi, aram, thanks for your reply, can you briefly explain how to troubleshoot storage and memory issues? thanks
Currently, both system storage and memory are left over.

aram · April 2, 2022, 6:08am

Both Vault and Consul are very susceptible to latency and timeout, so a small amount of network latency or simply a memory contention could cause a lot of issues without using all of the resources.

hongruiji · April 7, 2022, 8:26am

The problem has been solved. After investigation, it was found that the firewall intercepted and reset the TCP connection of the leader node connected to node 109:8300.
The reason for the interception is that there is a rule in the firewall to intercept repeated requests, which hits the action of the leader node connecting 109 nodes through TCP: 8300.
thank you for you help! !

Topic		Replies	Views
Consul failing to commit leader election results Consul	9	1825	November 22, 2022
How to enable debugs in raft and consul Consul	0	314	September 30, 2020
Consul master switch Consul	0	767	October 20, 2021
Docker Pause is causing consul remote site failure Consul	7	232	May 18, 2023
3-node cluster unhealthy after leader lost network connection Consul	3	4003	March 4, 2021

Consul debug hangs

Related topics