Is Consul WAN Federation Safe for Geo-Redundant Leader Election?

Use Case

I need to run a stateful application in active-passive mode across 2 geographic sites for disaster recovery. At any given time, only one instance should be active to avoid data corruption or conflicting operations.

Setup:

  • Site 1 (Primary): 3-5 Consul servers + Application instance

  • Site 2 (DR): 3-5 Consul servers + Application instance

  • Site 3 (Tiebreaker): 1-3 Consul servers (no app, just for quorum)

Requirements:

  • Automatic failover when primary site fails completely

  • Critical: Absolute guarantee against split-brain (both sites active simultaneously)

  • The application cannot tolerate even brief periods of dual-active state

Approach

Using Consul sessions + KV lock:

  • App acquires lock on service/myapp/leader with session

  • Session TTL = 30s with continuous renewals

  • On failure, session expires → lock released → failover

The Problem: Split-Brain During Network Partition?

Since WAN federation uses async gossip between datacenters:

Network partition: Site 1 isolated from Site 2 & 3

Site 1:
- Holds lock, renews session locally (Raft within Site 1)
- Cannot replicate to Site 2/3

Site 2:
- Stops receiving renewals
- Session expires after 30s
- Acquires lock

Result: Both sites think they hold the lock

Questions

  1. Can split-brain occur with this setup? My understanding is yes, because:

    • Strong consistency within each DC (Raft)

    • Eventual consistency between DCs (async gossip)

  2. Is Consul WAN federation designed for this use case? Or is it better suited for service discovery where eventual consistency is acceptable?

  3. Recommended approach? Should I:

    • Use etcd instead (single Raft across all sites)

Environment

  • K8s clusters in 3 geographic regions