Intermittent “No path to datacenter” issues with Consul WAN Federation via Mesh Gateways

Hello,

I am experiencing stability and connectivity issues with Consul WAN federation using mesh gateways between two datacenters (dc1 and dc2). While the setup initially appears functional, cross-datacenter service connectivity over the mesh is unreliable. Connectivity oscillates between working and failing states, frequently resulting in No path to datacenter errors.

At a control-plane level, federation appears healthy:

$ consul members -wan
Node          Address            Status  Type    Build   Protocol  DC   Partition  Segment
master-1.dc1  <wan-dc-1>:8302  alive   server  1.22.0  2         dc1  default    <all>
master-1.dc2  <wan-dc-2>:8302  alive   server  1.22.1  2         dc2  default    <all>

Service catalogs are visible across datacenters:

$ consul catalog services -datacenter=dc2
consul
mesh-gateway-dc2
web
web-sidecar-proxy

$ consul catalog services -datacenter=dc1
consul
mesh-gateway-dc1
socat
socat-sidecar-proxy

However, runtime behavior contradicts this apparent health. Server logs in dc2 repeatedly report federation, ACL, and config replication failures due to missing WAN paths, alongside Connect CA initialization failures indicating the primary datacenter is intermittently unreachable:

RPC request for DC is currently failing as no path was found: datacenter=dc1
...
Failed to initialize Connect CA: primary datacenter is unreachable
...
handling error in Manager.Notify: CA is uninitialized and unable to sign certificates yet

Envoy logs in the primary datacenter (dc1) intermittently show gRPC stream closures due to unauthenticated ACL access, despite using the bootstrap token and a fully bootstrapped ACL system:

unauthenticated: ACL system must be bootstrapped before making any requests that require authorization

In summary, WAN membership and catalog visibility suggest a correct configuration, but federation stability, ACL replication, Connect CA initialization, and dataplane traffic are all intermittently failing. I am currently unable to identify the underlying misconfiguration or systemic issue and would appreciate guidance on where to focus further troubleshooting.

Thank you in advance.

DC1 Node Config:

datacenter = "dc1"
data_dir = "/opt/consul/data"
node_name = "master-1"
client_addr = "0.0.0.0"
advertise_addr = "10.0.0.4"
advertise_addr_wan = "<dc1-wan-addr>"
server = true
log_level = "DEBUG"
bootstrap_expect = 1
tls {
  defaults {
    ca_file   = "/usr/local/share/ca-certificates/stack.crt"
    cert_file = "/etc/consul.d/tls/consul.crt"
    key_file  = "/etc/consul.d/tls/consul.key"
  }
  internal_rpc {
    verify_incoming = true
    verify_outgoing = true
    verify_server_hostname = true
  }
}

ports {
  http = 8500
  https = 8501
  grpc = 8502
  grpc_tls = 8503
  dns = 8600
  serf_lan = 8301
  serf_wan = 8302
}

ui_config {
  enabled = true
}

acl {
  enabled = true
  default_policy = "deny"
  enable_token_persistence = true
  enable_token_replication = true
}

connect {
  enabled = true
  enable_mesh_gateway_wan_federation=true
}

DC1 Envoy Command:

consul connect envoy -gateway=mesh -register -expose-servers \
  -service mesh-gateway-dc1 \
  -address 10.0.0.4:8443 \
  -wan-address <dc1-node-wan-addr>:8443 \
  -ca-file /usr/local/share/ca-certificates/stack.crt \
  -token <dc1-bootstrap-token>

DC2 Node Config:

datacenter = "dc2"
primary_datacenter = "dc1"
data_dir = "/opt/consul/data"
node_name = "master-1"
client_addr = "0.0.0.0"
advertise_addr = "10.1.0.5"
advertise_addr_wan = "<wan-addr-dc2>"
server = true
bootstrap_expect = 1
log_level = "DEBUG"
retry_join = [
  ]

tls {
  defaults {
    ca_file   = "/usr/local/share/ca-certificates/stack.crt"
    cert_file = "/etc/consul.d/tls/consul.crt"
    key_file  = "/etc/consul.d/tls/consul.key"
  }
  internal_rpc {
    verify_incoming = true
    verify_outgoing = true
    verify_server_hostname = true
  }
}

ports {
  http = 8500
  https = 8501
  grpc = 8502
  grpc_tls = 8503
  dns = 8600
  serf_lan = 8301
  serf_wan = 8302
}

ui_config {
  enabled = true
}

acl {
  enabled = true
  default_policy = "deny"
  enable_token_persistence = true
  enable_token_replication = true
  down_policy               = "extend-cache"
  tokens {
    replication = "<dc1-bootstrap-token>"
  }
}

primary_gateways = ["<wan-addr-dc-1>:8443"]

connect {
  enabled = true
  enable_mesh_gateway_wan_federation=true
}

DC2 Envoy Command:

consul connect envoy -gateway=mesh -register -service "gateway-secondary" -expose-servers \
  -ca-file=/usr/local/share/ca-certificates/stack.crt \
  -service mesh-gateway-dc2 \
  -address 10.1.0.5:8443 \
  -wan-address <dc2-node-wan-addr>:8443 \
  -token <dc1-bootstrap-token>

It looks there are some network hiccup or delay between two dc as you said that the error happens intermittently. i would suggest to add the bind_addr setting on the Consul servers in both primary dc and secondary to use a WAN routable ip address, and see if it helps to improve.