Consul master switch

supakornbabe · October 20, 2021, 7:57am

We are having 5 DCs that are running consul 1.6.1 (5 servers per DC).
Last week we got the alert that the consul is down for a minute and after we check the dashboard we found that the master is constantly reelection (but it happens only to 1 dc with the most usage)

(the color is indicated as current consul master node)

and after we check the logs from the master server we found the error logs like this within <1mins before the reelection time

2021/10/15 08:15:45 [ERROR] raft: Failed to make RequestVote RPC to {Voter d5832cb3-9950-6a8e-2f08-b985a146ff58 10.118.210.13:8300}: read tcp 10.118.210.10:34315->10.118.210.13:8300: i/o timeout

2021/10/15 14:54:26 [ERROR] raft: Failed to heartbeat to 10.118.210.11:8300: read tcp 10.118.210.12:48706->10.118.210.11:8300: i/o timeout
2021/10/15 14:54:28 [ERR] yamux: Failed to write body: write tcp 10.118.210.12:8300->10.118.210.11:41106: use of closed network connection
2021/10/15 14:54:35 [ERROR] raft: Failed to AppendEntries to {Voter 2cdc1d9e-64bc-02d4-9275-96e74de1eead 10.118.210.11:8300}: dial tcp 10.118.210.12:0->10.118.210.11:8300: i/o timeout
2021/10/15 14:54:37 [ERROR] raft: Failed to heartbeat to 10.118.210.11:8300: dial tcp 10.118.210.12:0->10.118.210.11:8300: i/o timeout
2021/10/15 14:54:42 [ERROR] raft: peer {Voter 2cdc1d9e-64bc-02d4-9275-96e74de1eead 10.118.210.11:8300} has newer term, stopping replication

2021/10/18 16:39:09 [ERROR] raft: Failed to pipeline AppendEntries to {Voter 2cdc1d9e-64bc-02d4-9275-96e74de1eead 10.118.210.11:8300}: write tcp 10.118.210.10:49065->10.118.210.11:8300: use of closed network connection
2021/10/18 16:39:09 [ERROR] raft: Failed to heartbeat to 10.118.210.11:8300: read tcp 10.118.210.10:39904->10.118.210.11:8300: i/o timeout
2021/10/18 16:39:11 [ERR] yamux: keepalive failed: i/o deadline reached

2021/10/20 15:18:53 [ERROR] raft: Failed to pipeline AppendEntries to {Voter d5832cb3-9950-6a8e-2f08-b985a146ff58 10.118.210.13:8300}: write tcp 10.118.210.12:50095->10.118.210.13:8300: use of closed network connection
2021/10/20 15:18:53 [ERROR] raft: Failed to heartbeat to 10.118.210.13:8300: read tcp 10.118.210.12:58070->10.118.210.13:8300: i/o timeout
2021/10/20 15:19:03 [ERROR] raft: Failed to AppendEntries to {Voter d5832cb3-9950-6a8e-2f08-b985a146ff58 10.118.210.13:8300}: dial tcp 10.118.210.12:0->10.118.210.13:8300: i/o timeout
2021/10/20 15:19:04 [ERROR] raft: Failed to heartbeat to 10.118.210.13:8300: dial tcp 10.118.210.12:0->10.118.210.13:8300: i/o timeout
2021/10/20 15:19:05 [ERR] consul.rpc: RPC error: rpc: can't find method Catalog.NodeServiceList from=10.121.97.44:50019
2021/10/20 15:19:06 [ERR] consul.rpc: RPC error: rpc: can't find method Catalog.NodeServiceList from=10.121.66.89:49003
2021/10/20 15:19:09 [ERR] consul.rpc: RPC error: rpc: can't find method Catalog.NodeServiceList from=10.121.66.185:55411
2021/10/20 15:19:10 [ERROR] raft: peer {Voter d5832cb3-9950-6a8e-2f08-b985a146ff58 10.118.210.13:8300} has newer term, stopping replication
2021/10/20 15:19:16 [ERR] consul.rpc: RPC error: rpc: can't find method Catalog.NodeServiceList from=10.121.97.101:52562
2021/10/20 15:19:17 [ERR] http: Request GET /v1/kv/local/it-messaging/adp-messaging-forwarder/v2/defaultProfile/unix?index=837458760&wait=1800000ms, error: No cluster leader from=10.118.181.208:47428
2021/10/20 15:19:17 [ERR] http: Request GET /v1/kv/local/it-messaging/adp-messaging-forwarder/v2/defaultProfile/unix?index=837458760&wait=1800000ms, error: No cluster leader from=10.118.182.122:49700

At first, we think it was an overloaded network but strangely we didn’t see any traffic increase compare to before the issue happened.

What could cause these i/o timeouts?

Thanks,
Supakorn Wongsawang

Topic		Replies	Views
Consul failing to commit leader election results Consul	9	1828	November 22, 2022
Too many election timeout reached and restarting election process Consul	0	894	May 12, 2021
Losing heartbeat and re-election leader Consul	0	354	April 24, 2023
Docker Pause is causing consul remote site failure Consul	7	232	May 18, 2023
Consul debug hangs Consul consul	4	669	April 7, 2022

Consul master switch

Related topics