I have a 3 node cluster in a kubernetes environment.
Node Address State Voter
---- ------- ----- -----
vault_0 100.64.4.49:8201 leader true
vault_2 100.64.2.121:8201 follower false
vault_1 100.64.1.159:8201 follower false
When a node vault_1
restarts the IP changes from 100.64.1.159 to 100.64.0.101. On the leader node vault_0
I can see heartbeat failures for many minutes, as expected since it is trying to contact the old IP.
2023-08-03T18:38:26.854Z [DEBUG] storage.raft: failed to contact: server-id=vault_1 time=4m38.444568426s
2023-08-03T18:38:29.335Z [DEBUG] storage.raft: failed to contact: server-id=vault_1 time=4m40.926139138s
2023-08-03T18:38:31.799Z [DEBUG] storage.raft: failed to contact: server-id=vault_1 time=4m43.389510693s
2023-08-03T18:38:33.868Z [ERROR] storage.raft: failed to appendEntries to: peer="{Nonvoter vault_1 100.64.1.159:8201}" error="dial tcp 100.64.1.159:8201: i/o timeout"
2023-08-03T18:38:34.249Z [DEBUG] storage.raft: failed to contact: server-id=vault_1 time=4m45.840082447s
2023-08-03T18:38:35.578Z [ERROR] storage.raft: failed to heartbeat to: peer=100.64.1.159:8201 backoff time=2.5s error="dial tcp 100.64.1.159:8201: i/o timeout"
2023-08-03T18:38:36.704Z [DEBUG] storage.raft: failed to contact: server-id=vault_1 time=4m48.294882758s
However, vault operator raft autopilot state
continues to show vault_1
as healthy with a recent last_contact
so there is no way to detect that this is a bad node.
$ vault operator raft autopilot state
Healthy: true
Failure Tolerance: 1
Leader: vault_0
Voters:
vault_0
vault_2
vault_1
Servers:
vault_0
Name: vault_0
Address: 100.64.4.49:8201
Status: leader
Node Status: alive
Healthy: true
Last Contact: 0s
Last Term: 4
Last Index: 54
Version: 1.14.1
Node Type: voter
vault_1
Name: vault_1
Address: 100.64.1.159:8201
Status: voter
Node Status: alive
Healthy: true
Last Contact: 3.591844703s
Last Term: 4
Last Index: 52
Version: 1.14.1
Node Type: voter
vault_2
Name: vault_2
Address: 100.64.2.121:8201
Status: voter
Node Status: alive
Healthy: true
Last Contact: 2.911340108s
Last Term: 4
Last Index: 54
Version: 1.14.1
Node Type: voter
My question is: how do we detect the node with changed IP? And how to recover from this case?
I can see that there are several other questions about node IP changes that went unanswered, so I am hoping this does better. Thank you for your help.