Bootstrapping consul connect wan federation

Hello, in the documents on WAN federation it states that ACLs are a prerequisite. I have done this in the primary datacenter, but ACL replication to the secondary datacenter cannot happen until there is WAN gossip, and I cannot get WAN gossip because I don’t have ACLs in the secondary DC (I get the following error trying to start envoy: Error registering service “gateway-secondary”: Unexpected response code: 403 (ACL not found))

How do we get around this chicken-egg problem?

Thanks

Hi @kkbe,

Welcome to the forums!

Before you start the mesh gateway on the secondary, you have to apply the replication token to the secondary DC. Did you miss this step?

Consul in secondary will use this replication token to kick off the initial sync directly talking to the primary gateway (defined using primary_gateways config option). Once this is successful, you will be able to register and launch the secondary gateway, post which the communication will be via the secondary gateway instance.

Hello, thanks for your help, but I’m still really struggling with this. Below is the steps I take:

1 - I create 6 servers in 2 datacenters

  • The primary is do-nyc2, the secondary is do-lon1.
  • both have ssl setup

2 - in the primary datacenter (do-nyc2) I bootstrap the acl and get the secret. “consul members” in this primary DC now shows the three servers

3 - I create two policies in consul:

  • mesh-gateway:
service_prefix "gateway" {
  policy = "write"
}
service_prefix "" {
  policy = "read"
}
node_prefix "" {
  policy = "read"
}
agent_prefix "" {
  policy = "read"
}
  • replication:
acl = "write"
operator = "write"
service_prefix "" {
policy = "read"
intentions = "read"
}

4 - I create three tokens, storing the secrets:

  • 1 for mesh gateway in primary dc
  • 1 for mesh gateway in secondary dc
  • 1 for replication

5 - I run the mesh gateway in the primary dc, node 0

consul connect envoy -gateway=mesh -register \
                     -service "gateway-primary" \
                     -address "$(ip a show eth1 |awk '$1 == "inet" {sub("/.*","",$2); print $2; exit}'):9999" \
                     -wan-address "$(ip a show eth0 |awk '$1 == "inet" {sub("/.*","",$2); print $2; exit}'):9999" \
                     -expose-servers \
                     -token=4337cd61-a3e9-1769-9700-ece230e426d0

At this point, running “consul members -wan” in the primary DC shows me the servers in the secondary DC! hurrah!

6 - in the secondary DC, on each server, I run:

export CONSUL_HTTP_TOKEN=<bootstrap token>
consul acl set-agent-token replication <replication token from step 4>

and I get “ACL token “replication” set successfully”

And this is where I’m stuck. The next step should be running the gateway on the secondary DC:

consul connect envoy -gateway=mesh -register \
>                      -service "gateway-secondary" \
>                      -address "$(ip a show eth1 |awk '$1 == "inet" {sub("/.*","",$2); print $2; exit}'):9999" \
>                      -wan-address "$(ip a show eth0 |awk '$1 == "inet" {sub("/.*","",$2); print $2; exit}'):9999" \
>                      -expose-servers \
>                      -token=0ca62fd3-6e64-c7e2-9846-f7f95fc3268f

but this gives me an error:

Error registering service "gateway-secondary": Unexpected response code: 403 (could not retrieve initial service_defaults config for service "gateway-secondary": ACL not found)

I check the replication status and it doesn’t seem to be replicating:

$ curl http://localhost:8500/v1/acl/replication?pretty
{
    "Enabled": true,
    "Running": true,
    "SourceDatacenter": "do-nyc2",
    "ReplicationType": "tokens",
    "ReplicatedIndex": 0,
    "ReplicatedRoleIndex": 0,
    "ReplicatedTokenIndex": 0,
    "LastSuccess": "0001-01-01T00:00:00Z",
    "LastError": "2021-08-11T14:56:46Z"
}

I can post configs and logs if necessary, but only on request since this post is already very long

Thankyou very much for any help you can provide

The steps you followed seems to be correct. But one thing to note is, you should have the replication working before launching the mesh gateway in secondary.

Could you please share the configuration of the secondary DC and also the logs from the secondary DC?

Your secondary should start attempting replication as long as you have primary_dc, acl.enable_token_replication = true and primary_gateways is set in your secondary configuration.

The replication has to be sorted first before attempting to launch MeshGateway in Secondary.

consul config for server 0 in secondary dc (do-lon1)

autopilot = {
  cleanup_dead_servers = true
}

bind_addr = "{{GetInterfaceIP \"eth1\"}}"
data_dir = "/var/consul"
datacenter = "do-lon1"
encrypt = "<redacted>"
node_name = "kktest-nomad-server-lon1-0"
primary_datacenter = "do-nyc2" 
ports {
  grpc = 8502
}
retry_join = ["provider=digitalocean region=lon1 tag_name=nomad_server api_token=<redacted>"]
acl = {                                                                                                                                                                     enabled = true
  default_policy = "deny"
  enable_token_persistence = true
  enable_token_replication = true
}
connect {
  enabled = true
  enable_mesh_gateway_wan_federation = true
}
ui = true
server = true
bootstrap_expect = 3

verify_incoming = true,
verify_outgoing = true,
verify_server_hostname = true,
ca_file = "/etc/consul.d/consul-agent-ca.pem",
cert_file = "/etc/consul.d/do-lon1-server-consul-0.pem",
key_file = "/etc/consul.d/do-lon1-server-consul-0-key.pem",
auto_encrypt {
  allow_tls = true
}

primary_gateways = ["<redacted>:9999"]

here’s some relevant log lines

Aug 11 16:28:28 [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.128.0.4:8300 datacenter=do-nyc2 method=ConfigEntry.ListAll error="rpc error getting client: failed to get conn: dial tcp 10.131.0.141:0->192.241.240.197:9999: i/o timeout"
Aug 11 16:28:28 [WARN]  agent.server.replication.config_entry: replication error (will retry if still leader): error="failed to retrieve remote config entries: rpc error getting client: failed to get conn: dial tcp 10.131.0.141:0->192.241.240.197:9999: i/o timeout"
Aug 11 16:28:28 [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.128.0.4:8300 datacenter=do-nyc2 method=FederationState.List error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Aug 11 16:28:28 [WARN]  agent.server.replication.federation_state: replication error (will retry if still leader): error="failed to retrieve federation states: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Aug 11 16:28:37 [ERROR] agent: Coordinate update error: error="ACL not found"
Aug 11 16:28:44 [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.128.0.5:8300 datacenter=do-nyc2 method=Intention.List error="rpc error getting client: failed to get conn: dial tcp 10.131.0.141:0->192.241.240.197:9999: i/o timeout"
Aug 11 16:28:44 [ERROR] agent.server.connect: error performing intention migration in secondary datacenter, will retry: routine="intention config entry migration" error="rpc error getting client: failed to get conn: dial tcp 10.131.0.141:0->192.241.240.197:9999: i/o timeout"
Aug 11 16:28:44 [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.128.0.5:8300 datacenter=do-nyc2 method=ConnectCA.Roots error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Aug 11 16:28:44 [ERROR] agent.server.connect: CA root replication failed, will retry: routine="secondary CA roots watch" error="Error retrieving the primary datacenter's roots: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Aug 11 16:28:47 [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.128.0.2:8300 datacenter=do-nyc2 method=ACL.TokenRead error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Aug 11 16:28:47 [ERROR] agent.anti_entropy: failed to sync remote state: error="ACL not found"
Aug 11 16:28:47 [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=127.0.0.1:60062 error="ACL not found"
Aug 11 16:28:47 [ERROR] agent.http: Request error: method=GET url=/v1/catalog/service/nomad?dc=do-lon1&near=_agent&stale=&tag=serf&wait=2000ms from=127.0.0.1:60108 error="ACL not found"
Aug 11 16:28:47 [ERROR] agent.http: Request error: method=GET url=/v1/agent/services from=127.0.0.1:60112 error="ACL not found"
Aug 11 16:28:57 [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.128.0.5:8300 datacenter=do-nyc2 method=Catalog.ServiceNodes error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Aug 11 16:28:57 [ERROR] agent.http: Request error: method=GET url=/v1/catalog/service/nomad?dc=do-nyc2&near=_agent&stale=&tag=serf&wait=2000ms from=127.0.0.1:60108 error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Aug 11 16:29:03 [ERROR] agent: Coordinate update error: error="ACL not found"
Aug 11 16:29:06 [INFO]  agent.server.memberlist.wan: memberlist: Suspect kktest-nomad-server-nyc2-2.do-nyc2 has failed, no acks received
Aug 11 16:29:07 [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.128.0.4:8300 datacenter=do-nyc2 method=ACL.TokenList error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Aug 11 16:29:07 [WARN]  agent.server.replication.acl.token: ACL replication error (will retry if still leader): error="failed to retrieve remote ACL tokens: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Aug 11 16:29:07 [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.128.0.4:8300 datacenter=do-nyc2 method=ACL.PolicyList error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Aug 11 16:29:07 [WARN]  agent.server.replication.acl.policy: ACL replication error (will retry if still leader): error="failed to retrieve remote ACL policies: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Aug 11 16:29:07 [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.128.0.4:8300 datacenter=do-nyc2 method=ACL.RoleList error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Aug 11 16:29:07 [WARN]  agent.server.replication.acl.role: ACL replication error (will retry if still leader): error="failed to retrieve remote ACL roles: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Aug 11 16:29:12 [ERROR] agent.anti_entropy: failed to sync remote state: error="ACL not found"
Aug 11 16:29:17 [WARN]  agent.server.rpc: RPC request to DC is currently failing as no server can be reached: datacenter=do-nyc2
Aug 11 16:29:17 [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=127.0.0.1:60112 error="ACL not found"
Aug 11 16:29:17 [ERROR] agent.http: Request error: method=GET url=/v1/agent/services from=127.0.0.1:60112 error="ACL not found"
Aug 11 16:29:21 [WARN]  agent.server.rpc: RPC request to DC is currently failing as no server can be reached: datacenter=do-nyc2
Aug 11 16:29:21 [ERROR] agent.server: error performing anti-entropy sync of federation state: error="error performing federation state anti-entropy sync: Remote DC has no server currently reachable"
Aug 11 16:29:23 [ERROR] agent: Coordinate update error: error="ACL not found"
Aug 11 16:29:38 [ERROR] agent.anti_entropy: failed to sync remote state: error="ACL not found"
Aug 11 16:29:41 [ERROR] agent: Coordinate update error: error="ACL not found"
Aug 11 16:29:46 [INFO]  agent.server.memberlist.wan: memberlist: Suspect kktest-nomad-server-nyc2-1.do-nyc2 has failed, no acks received
Aug 11 16:29:47 [WARN]  agent.server.rpc: RPC request to DC is currently failing as no server can be reached: datacenter=do-nyc2
Aug 11 16:29:47 [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=127.0.0.1:60112 error="ACL not found"
Aug 11 16:29:47 [ERROR] agent.http: Request error: method=GET url=/v1/agent/services from=127.0.0.1:60112 error="ACL not found"
Aug 11 16:29:49 [ERROR] agent.http: Request error: method=GET url=/v1/catalog/service/nomad?dc=do-lon1&near=_agent&stale=&tag=serf&wait=2000ms from=127.0.0.1:60112 error="ACL not found"
Aug 11 16:29:49 [WARN]  agent.server.rpc: RPC request to DC is currently failing as no server can be reached: datacenter=do-nyc2
Aug 11 16:29:49 [ERROR] agent.http: Request error: method=GET url=/v1/catalog/service/nomad?dc=do-nyc2&near=_agent&stale=&tag=serf&wait=2000ms from=127.0.0.1:60112 error="Remote DC has no server currently reachable"
Aug 11 16:29:57 [ERROR] agent.anti_entropy: failed to sync remote state: error="ACL not found"
Aug 11 16:30:03 [ERROR] agent: Coordinate update error: error="ACL not found"
Aug 11 16:30:09 [INFO]  agent.server.memberlist.wan: memberlist: Marking kktest-nomad-server-nyc2-2.do-nyc2 as failed, suspect timeout reached (2 peer confirmations)
Aug 11 16:30:09 [INFO]  agent.server.serf.wan: serf: EventMemberFailed: kktest-nomad-server-nyc2-2.do-nyc2 10.128.0.4
Aug 11 16:30:09 [INFO]  agent.server: Handled event for server in area: event=member-failed server=kktest-nomad-server-nyc2-2.do-nyc2 area=wan
Aug 11 16:30:11 [WARN]  agent.server.rpc: RPC request to DC is currently failing as no server can be reached: datacenter=do-nyc2
Aug 11 16:30:11 [WARN]  agent.server.replication.acl.role: ACL replication error (will retry if still leader): error="failed to retrieve remote ACL roles: Remote DC has no server currently reachable"
Aug 11 16:30:11 [WARN]  agent.server.rpc: RPC request to DC is currently failing as no server can be reached: datacenter=do-nyc2
Aug 11 16:30:11 [WARN]  agent.server.replication.acl.token: ACL replication error (will retry if still leader): error="failed to retrieve remote ACL tokens: Remote DC has no server currently reachable"
Aug 11 16:30:11 [WARN]  agent.server.rpc: RPC request to DC is currently failing as no server can be reached: datacenter=do-nyc2
Aug 11 16:30:11 [WARN]  agent.server.replication.acl.policy: ACL replication error (will retry if still leader): error="failed to retrieve remote ACL policies: Remote DC has no server currently reachable"

it looks like there’s a connectivity issue. The nodes have two interfaces. Consul is running on the internal interface (10.131.0.141 on this machine - in the secondary dc) and is failing to connect to 192.241.240.197:9999, which is the public ip of node 0 in the primary dc - where envoy is running.
However, the instance itself can certainly connect (there’s a default route and I can netcat to the port), so I’m not sure what’s going on. There obviously is some connectivity, because consul members -wan in the primary DC shows the nodes in the secondary DC. So…I’m lost.

Thanks again for your help

OK, I think I’ve fixed it. I’ve set up packet forwarding and natting on the external interface and now I get:

curl http://localhost:8500/v1/acl/replication?pretty
{
    "Enabled": true,
    "Running": true,
    "SourceDatacenter": "do-nyc2",
    "ReplicationType": "tokens",
    "ReplicatedIndex": 20,
    "ReplicatedRoleIndex": 1,
    "ReplicatedTokenIndex": 24,
    "LastSuccess": "2021-08-11T16:48:22Z",
    "LastError": "2021-08-11T16:47:49Z"
}

I will do some tests and let you know. Thanks