The Consul outage recovery documentation, in the scenario “ Failure of Multiple Servers in a Multi-Server Cluster”, mentions that both data loss and committing of previously uncommitted data are possible.
I would like to have more details about these scenarios. In particular, assuming that it survived at least one server that was a member of the quorum, what type of data loss or committing of uncommitted data can occur? Is there any sequence of steps that can prevent these issues to happen?
I assume that having in the set of survived servers at least one server that was generating the quorum, this server’s log should be perfectly aligned to the log of the consul instance that was leader before the outage.
Regards,
Bernardo
Hi @randomswdev ,
Here are some examples for both data loss and committing of uncommitted data.
Protocol Background:
- If you have a cluster size of 3, the quorum size is 2.
- Updates operations go to the leader.
- Before updates are committed they must be sent to a quorum of followers and the replication must be acknowledged by those followers.
Scenario:
- 3 consul servers: Node A (leader), Nodes B and C (followers)
Suppose that one of the followers (Node C) becomes briefly disconnected from the cluster. Node A can still commit data since Nodes A and B form a quorum. If some commits are made then both the leader and Node B both crash and lose their persisted data permanently, Consul could be recovered with Node C’s state which will be missing those commits. Hence resulting in data loss.
Additionally, the more straight-forward case is that if all servers crash and lose their data on disk. That will also result in data loss. The amount of data lost in that case would depend on whether snapshots have been taken as well as the time since the last snapshot.
Now suppose that instead of Nodes A and B going down, all nodes crash but only the followers lose their data on disk. Additionally, the leader had some updates that were being replicated to the followers but hadn’t been committed yet. If Consul is restored with the leader’s data, the restore operation will take those updates and commit them to a snapshot even if they were previously uncommitted. This is because the index of the last log committed is stored in memory, so that information has been lost.
The main objective to prevent these issues from happening is to avoid losing multiple servers at a time. Some of the most important things to do are:
- Run Consul servers on well-provisioned machines, as per our reference architecture. If using Consul as a backend for Vault, please reference Vault’s reference architecture for additional guidance.
- If a server fails it should promptly rejoin the cluster.
- Regular snapshots should be taken and stored in a secure location outside of the Consul server hosts.
- Snapshotting can also be improved by running a cluster of Consul snapshot agents that send regular snapshots to s3-compatible endpoints or Azure Blob Storage (Enterprise feature).
Best,
Freddy