Error while setting up secondary datacenter with ACL configured

Hello,

I’m trying to deploy multiple architectures with Consul open source. What I was able to do so far (in different deployments)

  1. a single datacenter setup with ACL enabled :white_check_mark:
  2. a multi datacenter setup with ACL disabled :white_check_mark:
  3. a multi datacenter setup with ACL enabled :x:

In deployment 3, the error happens when I try to join the secondary cluster to the primary datacenter. This is the CLI I’m using from the node of the secondary dc:

consul join -token=“XXXXX” -wan <public_ip_of_node_in_primary_dc>

The token I’m using in the above CLI is the initial_management I generated when boostrapping ACLs in the primary datacenter.

The error I’m getting is the following:

Error joining address ‘<public_ip>’: Unexpected response code: 403 (Permission denied: token with AccessorID ‘primary-dc-down’ lacks permission ‘agent:write’ on “test-dlo”)
Failed to join any nodes.

Strangely enough, when tailing the logs from the primary dc node, I cannot see any request incoming, so I believe the 403 error is coming from the secondary datacenter itself. :thinking:

The config files I’m using are the following:

Primary datacenter:

{
  "datacenter": "minus3-europe",
  "data_dir": "/consuldata",
  "node_name": "ConsulServer-10-0-2-142",
  "server": true,
  "bootstrap_expect": 3,
  "advertise_addr": "10.0.2.142",
  "advertise_addr_wan": "<public_ip>",
  "leave_on_terminate": true,
  "reconnect_timeout": "8h",
  "reconnect_timeout_wan": "8h",
  "retry_join": ["provider=aws tag_key=ConsulAutoJoinSecret tag_value=17d651ad-dfb2-abeb-30c3-81621dd65a17"],
  "log_file": "/consuldata/",
  "log_level": "DEBUG",
  "log_rotate_duration": "24h",
  "log_rotate_max_files": 7,
  "ui_config": {
    "enabled": true
  },
  "bind_addr": "0.0.0.0",
  "addresses": {
    "http": "0.0.0.0"
  },
  "acl": {
    "enabled": true,
    "default_policy": "deny",
    "enable_token_persistence": true,
    "enable_token_replication": true,
    "tokens": {
      "initial_management": "XXXXX"
    }
  },
  "primary_datacenter": "minus3-europe"
}

Secondary DC:

{
  "datacenter": "minus3-us",
  "primary_datacenter": "minus3-europe",
  "data_dir": "/consuldata",
  "node_name": "test-dlo",
  "server": true,
  "bootstrap_expect": 1,
  "advertise_addr": "172.31.7.62",
  "advertise_addr_wan": "<public_ip>",
  "leave_on_terminate": true,
  "reconnect_timeout": "8h",
  "reconnect_timeout_wan": "8h",
  "retry_join": ["provider=aws tag_key=Name tag_value=TEstDLO"],
  "log_file": "/consuldata/",
  "log_level": "DEBUG",
  "log_rotate_duration": "24h",
  "log_rotate_max_files": 7,
  "ui_config": {
    "enabled": true
  },
  "bind_addr": "0.0.0.0",
  "addresses": {
    "http": "0.0.0.0"
  },
  "acl": {
    "enabled": true,
    "default_policy": "deny",
    "down_policy": "deny",
    "enable_token_persistence": true,
    "enable_token_replication": true
  }
}

Besides the error I get when trying to join the WAN, the log file of the node in the secondary DC is polluted with the following:

2022-09-06T11:01:08.461Z [WARN] agent.server.rpc: RPC request for DC is currently failing as no path was found: datacenter=minus3-europe method=ACL.TokenRead
2022-09-06T11:01:08.462Z [ERROR] agent.acl: Error resolving token: error=“Error communicating with the ACL Datacenter: No path to datacenter”
2022-09-06T11:01:08.462Z [WARN] agent.server.rpc: RPC request for DC is currently failing as no path was found: datacenter=minus3-europe method=ACL.TokenRead
2022-09-06T11:01:08.462Z [ERROR] agent.acl: Error resolving token: error=“Error communicating with the ACL Datacenter: No path to datacenter”
2022-09-06T11:01:08.462Z [WARN] agent: Coordinate update blocked by ACLs: accessorID=primary-dc-down
2022-09-06T11:01:10.815Z [DEBUG] agent.server: federation states are not enabled in the primary dc
2022-09-06T11:01:15.815Z [DEBUG] agent.server: federation states are not enabled in the primary dc
2022-09-06T11:01:20.815Z [DEBUG] agent.server: federation states are not enabled in the primary dc
2022-09-06T11:01:25.815Z [DEBUG] agent.server: federation states are not enabled in the primary dc
2022-09-06T11:01:27.613Z [WARN] agent.server.rpc: RPC request for DC is currently failing as no path was found: datacenter=minus3-europe method=ACL.TokenRead
2022-09-06T11:01:27.614Z [ERROR] agent.acl: Error resolving token: error=“Error communicating with the ACL Datacenter: No path to datacenter”
2022-09-06T11:01:27.614Z [WARN] agent.server.rpc: RPC request for DC is currently failing as no path was found: datacenter=minus3-europe method=ACL.TokenRead
2022-09-06T11:01:27.614Z [ERROR] agent.acl: Error resolving token: error=“Error communicating with the ACL Datacenter: No path to datacenter”
2022-09-06T11:01:27.614Z [WARN] agent: Coordinate update blocked by ACLs: accessorID=primary-dc-down

I understand that after joining the WAN I still need to configure the replication tokens in the secondary datacenter, but I think that is the next step right? First and foremost, I need the secondary datacenter to properly join the WAN, but I cannot make progress from here.

Any ideas of what am I missing?

Thanks,
David

In other words, I guess what I’m asking is: do I need to bootstrap ACLs in each DC separately before joining the secondary DC via WAN to the primary one?

The steps that I did (with the outcome above) were basically:

  1. setup primary DC
  2. bootstrap ACLs
  3. setup secondary DC
  4. try to join the secondary DC to the primary one using the tokens from the primary DC

Hi @david.lopes,

Welcome to the HashiCorp Forums!

You should be able to get this working if you add the following into your secondary DC config and restart Consul:

  1. acl { tokens { replication = <replication_token_created_in_primary_dc> }}
  2. retry_join_wan = ["<public_ip_of_node_in_primary_dc>"]

In step 1 above, for testing only (important), you can use the initial-management token from primary to keep things simple.

Currently, what is happening is that you have ACL replication enabled in the secondary config, and in addition, you are not bootstrapping ACLs in the secondary. This will put the Consul servers in the secondary to be in an endless loop trying to reach the primary DC to bring up the ACL subsystem when Consul doesn’t know its primary DC.

Now, as the ACL subsystem is not available in secondary, you won’t be able to interact with Consul, which is why your consul join -wan command is failing.

Please try the above and let us know if it worked.

2 Likes

Yep, that worked flawlessly! Thanks @Ranjandas .

However, this seems a bit counter intuitive. I would assume that joining manually after startup or using the retry_join_wan config to be equivalent (except the retry pattern built-in in the config).

@david.lopes, that assumption is valid as long as the clusters run without ACLs.

It is also valid if you bootstrap the Secondary DC ACLs and then enable ACL replication, and use the privilege of local token to join the cluster to WAN manually.

The problem you were facing was that you pretty much locked the secondary cluster by enabling ACLs but not bootstrapping it fully or configuring it in such a way that it could work its way towards bringing up the ACL stack up (by retry_join_wan).

I hope this explains your scenario.

It is also valid if you bootstrap the Secondary DC ACLs and then enable ACL replication, and use the privilege of local token to join the cluster to WAN manually.

How does this work? If I bootstrap the ACLs in both datacenters and then use the secondary master token to join the WAN, how does the primary datacenter validates the “join request” from the secondary? I mean, the primary DC does not know anything about the tokens generated in the secondary datacenter.

I’m guessing that somewhere, some token that lives in primary DC must be provided when executing the join -wan command.

It is an excellent question. I will try to explain below. One important thing to understand here is that Consul’s gossip layer (Serf) is not controlled by ACLs. Consul ACLs protect only the RPC (agent-to-agent) and HTTP interfaces (UI, API and CLI).

ACLs authenticate requests and authorize access to resources. They also control access to the Consul UI, API, and CLI, as well as secure service-to-service and agent-to-agent communication.
ref: Access Control List (ACL) Overview | Consul by HashiCorp

So when you WAN join a cluster with another, essentially, the server agents in both clusters get added to the Serf WAN Pool of each cluster using gossip. ACLs don’t have any role to play in this.

You can do the following setup to see and understand this in action (on a Linux box). If you are trying the commands on macOS, make sure you add an additional IP to loopback using sudo ifconfig lo0 alias 127.0.0.2)

  • Run a single node consul datacenter (dc1) with full ACLs enabled

    $ consul agent -dev -bind 127.0.0.1 -client 127.0.0.1 -hcl 'acl { enabled = true default_policy="deny" tokens { master="root" agent="root" } }'
    
  • Run another single node consul datacenter (dc2) with no ACLs at all, and WAN join it to dc1.

    $ consul agent -dev -bind 127.0.0.2 -client 127.0.0.2 -datacenter dc2 -retry-join-wan 127.0.0.1
    
  • Check the WAN members against both DC’s to find that they have successfully WAN joined.

    # Against DC1, we need to pass token
    $ consul members -wan -token root
    Node                   Address         Status  Type    Build   Protocol  DC   Partition  Segment
    MacBook-Pro.local.dc1  127.0.0.1:8302  alive   server  1.13.1  2         dc1  default    <all>
    MacBook-Pro.local.dc2  127.0.0.2:8302  alive   server  1.13.1  2         dc2  default    <all>
    
    # Against DC2, we don't need tokens as ACLs aren't enabled.
    $ consul members -wan -http-addr 127.0.0.2:8500
    Node                   Address         Status  Type    Build   Protocol  DC   Partition  Segment
    MacBook-Pro.local.dc1  127.0.0.1:8302  alive   server  1.13.1  2         dc1  default    <all>
    MacBook-Pro.local.dc2  127.0.0.2:8302  alive   server  1.13.1  2         dc2  default    <all>
    

    Now that the clusters are WAN joined, you will be able to interact with one cluster from the other (rpc’s will get forwarded) with the following conditions:

    • DC1 talking to DC2 don’t need any tokens
    • DC2 talking to DC1 would need a valid token from DC1 in the requests
    # Write a KV entry to dc2 from dc1 (no auth required)
    $ consul kv put -datacenter dc2 cluster dc2
    Success! Data written to: cluster
    
    # Writing a KV entry to dc1 from dc2 without token fails
    $ consul kv put -http-addr 127.0.0.2:8500 -datacenter dc1 cluster dc1
    consul kv put -http-addr 127.0.0.2:8500 -datacenter dc1 cluster dc1
    Error! Failed writing data: Unexpected response code: 403 (rpc error making call: Permission denied: 
    token with AccessorID '00000000-0000-0000-0000-000000000002' lacks permission 'key:write' on 
    "cluster")
    
    # The same above command with token from dc1 works 
    $ consul kv put -http-addr 127.0.0.2:8500 -datacenter dc1 -token root cluster dc1
    Success! Data written to: cluster
    

I hope with the above details you will be able to understand

  1. Why your consul join -wan command failed?
    While running the above command, your secondary cluster didn’t have the ACL system fully functional to validate your token because:

    • the ACL system was not bootstrapped
    • the cluster couldn’t replicate the ACL and finish bootstrapping, as it couldn’t find the path to primary DC to replicate ACLs.
  2. Why was it required to add retry_join_wan for your setup to work?
    When you added retry_join_wan in the config, the consul agent startup sequence successfully populated the WAN Pool. This enabled the leader in DC2 to discover servers in DC1 and later use the replication token to start replicating ACLs by forwarding the replication RPC to DC1.

I hope this answers your question.

2 Likes