We are having 5 DCs that are running consul 1.6.1 (5 servers per DC).
Last week we got the alert that the consul is down for a minute and after we check the dashboard we found that the master is constantly reelection (but it happens only to 1 dc with the most usage)
(the color is indicated as current consul master node)
and after we check the logs from the master server we found the error logs like this within <1mins before the reelection time
2021/10/15 08:15:45 [ERROR] raft: Failed to make RequestVote RPC to {Voter d5832cb3-9950-6a8e-2f08-b985a146ff58 10.118.210.13:8300}: read tcp 10.118.210.10:34315->10.118.210.13:8300: i/o timeout
2021/10/15 14:54:26 [ERROR] raft: Failed to heartbeat to 10.118.210.11:8300: read tcp 10.118.210.12:48706->10.118.210.11:8300: i/o timeout
2021/10/15 14:54:28 [ERR] yamux: Failed to write body: write tcp 10.118.210.12:8300->10.118.210.11:41106: use of closed network connection
2021/10/15 14:54:35 [ERROR] raft: Failed to AppendEntries to {Voter 2cdc1d9e-64bc-02d4-9275-96e74de1eead 10.118.210.11:8300}: dial tcp 10.118.210.12:0->10.118.210.11:8300: i/o timeout
2021/10/15 14:54:37 [ERROR] raft: Failed to heartbeat to 10.118.210.11:8300: dial tcp 10.118.210.12:0->10.118.210.11:8300: i/o timeout
2021/10/15 14:54:42 [ERROR] raft: peer {Voter 2cdc1d9e-64bc-02d4-9275-96e74de1eead 10.118.210.11:8300} has newer term, stopping replication
2021/10/18 16:39:09 [ERROR] raft: Failed to pipeline AppendEntries to {Voter 2cdc1d9e-64bc-02d4-9275-96e74de1eead 10.118.210.11:8300}: write tcp 10.118.210.10:49065->10.118.210.11:8300: use of closed network connection
2021/10/18 16:39:09 [ERROR] raft: Failed to heartbeat to 10.118.210.11:8300: read tcp 10.118.210.10:39904->10.118.210.11:8300: i/o timeout
2021/10/18 16:39:11 [ERR] yamux: keepalive failed: i/o deadline reached
2021/10/20 15:18:53 [ERROR] raft: Failed to pipeline AppendEntries to {Voter d5832cb3-9950-6a8e-2f08-b985a146ff58 10.118.210.13:8300}: write tcp 10.118.210.12:50095->10.118.210.13:8300: use of closed network connection
2021/10/20 15:18:53 [ERROR] raft: Failed to heartbeat to 10.118.210.13:8300: read tcp 10.118.210.12:58070->10.118.210.13:8300: i/o timeout
2021/10/20 15:19:03 [ERROR] raft: Failed to AppendEntries to {Voter d5832cb3-9950-6a8e-2f08-b985a146ff58 10.118.210.13:8300}: dial tcp 10.118.210.12:0->10.118.210.13:8300: i/o timeout
2021/10/20 15:19:04 [ERROR] raft: Failed to heartbeat to 10.118.210.13:8300: dial tcp 10.118.210.12:0->10.118.210.13:8300: i/o timeout
2021/10/20 15:19:05 [ERR] consul.rpc: RPC error: rpc: can't find method Catalog.NodeServiceList from=10.121.97.44:50019
2021/10/20 15:19:06 [ERR] consul.rpc: RPC error: rpc: can't find method Catalog.NodeServiceList from=10.121.66.89:49003
2021/10/20 15:19:09 [ERR] consul.rpc: RPC error: rpc: can't find method Catalog.NodeServiceList from=10.121.66.185:55411
2021/10/20 15:19:10 [ERROR] raft: peer {Voter d5832cb3-9950-6a8e-2f08-b985a146ff58 10.118.210.13:8300} has newer term, stopping replication
2021/10/20 15:19:16 [ERR] consul.rpc: RPC error: rpc: can't find method Catalog.NodeServiceList from=10.121.97.101:52562
2021/10/20 15:19:17 [ERR] http: Request GET /v1/kv/local/it-messaging/adp-messaging-forwarder/v2/defaultProfile/unix?index=837458760&wait=1800000ms, error: No cluster leader from=10.118.181.208:47428
2021/10/20 15:19:17 [ERR] http: Request GET /v1/kv/local/it-messaging/adp-messaging-forwarder/v2/defaultProfile/unix?index=837458760&wait=1800000ms, error: No cluster leader from=10.118.182.122:49700
At first, we think it was an overloaded network but strangely we didn’t see any traffic increase compare to before the issue happened.
What could cause these i/o timeouts?
Thanks,
Supakorn Wongsawang