Use Case
I need to run a stateful application in active-passive mode across 2 geographic sites for disaster recovery. At any given time, only one instance should be active to avoid data corruption or conflicting operations.
Setup:
-
Site 1 (Primary): 3-5 Consul servers + Application instance
-
Site 2 (DR): 3-5 Consul servers + Application instance
-
Site 3 (Tiebreaker): 1-3 Consul servers (no app, just for quorum)
Requirements:
-
Automatic failover when primary site fails completely
-
Critical: Absolute guarantee against split-brain (both sites active simultaneously)
-
The application cannot tolerate even brief periods of dual-active state
Approach
Using Consul sessions + KV lock:
-
App acquires lock on
service/myapp/leaderwith session -
Session TTL = 30s with continuous renewals
-
On failure, session expires → lock released → failover
The Problem: Split-Brain During Network Partition?
Since WAN federation uses async gossip between datacenters:
Network partition: Site 1 isolated from Site 2 & 3
Site 1:
- Holds lock, renews session locally (Raft within Site 1)
- Cannot replicate to Site 2/3
Site 2:
- Stops receiving renewals
- Session expires after 30s
- Acquires lock
Result: Both sites think they hold the lock
Questions
-
Can split-brain occur with this setup? My understanding is yes, because:
-
Strong consistency within each DC (Raft)
-
Eventual consistency between DCs (async gossip)
-
-
Is Consul WAN federation designed for this use case? Or is it better suited for service discovery where eventual consistency is acceptable?
-
Recommended approach? Should I:
- Use etcd instead (single Raft across all sites)
Environment
- K8s clusters in 3 geographic regions