Hi @ikonia,
The details you are looking for belong to the internals of the Raft Protocol. Here are the answers for your questions on a high level.
- Where is the log file?
In this case, the log file is the raft.db
file inside each server agent’s Consul data directory. Every write request (state change) to the cluster that ends up on the Leader would first get written in the Leader’s raft.db, which will be replicated to the followers over RPC. Every follower will write the entry to its raft.db and finally write it to an in-memory DB.
This is what is explained here:
- The durable storage here is raft.db
Once a cluster has a leader, it is able to accept new log entries. A client can request that a leader append a new log entry (from Raft’s perspective, a log entry is an opaque binary blob). The leader then writes the entry to durable storage and attempts to replicate to a quorum of followers. Once the log entry is considered committed , it can be applied to a finite state machine. The finite state machine is application specific; in Consul’s case, we use MemDB to maintain cluster state. Consul’s writes block until it is both committed and applied . This achieves read after write semantics when used with the consistent mode for queries.
ref: Consensus Protocol | Raft | Consul | HashiCorp Developer
- The configuration that sets the path
The path to the raft.db is hardcoded to be <data-dir>/raft/raft.db
. It can’t be modified.
- Why would a log entry go missing?
The short answer is that the raft.db
will be compacted by removing older entries. This itself won’t cause the log missing issue, though.
Obviously, it would be undesirable to allow a replicated log to grow in an unbounded fashion. Raft provides a mechanism by which the current state is snapshotted and the log is compacted. Because of the FSM abstraction, restoring the state of the FSM must result in the same state as a replay of old logs. This allows Raft to capture the FSM state at a point in time and then remove all the logs that were used to reach that state. This is performed automatically without user intervention and prevents unbounded disk usage while also minimizing time spent replaying logs. One of the advantages of using MemDB is that it allows Consul to continue accepting new transactions even while old state is being snapshotted, preventing any availability issues.
One scenario where you would see the log missing error is as explained below.
Lets say you have 3 server agents. Data is constantly replicated between them. Lets say for some reason one node is dead. The cluster would continue to function but the data is not replicated to this failed follower.
Let us assume that before the node died, all the agents had replicated up to index 100. Now continuous writes are happening and the index reached 200. Let us assume that this is when compaction happened on the leader, and now the leaders raft db has index 150 to 200.
Now let us assume the failed node has come back. At this stage, the node would join the cluster, and the leader would send index 201 to the recovered node. But we know that the last index is 100 for this node, and it would ask for index 101 to the leader. At this point, leader would try to replay logs from 101 to 201 from its raft.db, but as the raft.db had already been compacted, the leader would through the error saying that log not found
.
But this is still ok, the leader would now send a latest snapshot of the memdb (stored in <data-dir>/raft/snapshots/
). The recovered node will consume this snapshot and then be ready to receive data from latest index the snapshot that it received.
I hope this gave you some direction to explore. There could be more such scenarios, but in general raft is smart enough to recover. I am unsure what happened with your cluster that they refused to start.
The following content might help you to understand this better.