Removing permissions from default (anonymous) policy causes multi-DC Connect connections to fail

My configuration

I set up Consul on two Kubernetes clusters (let’s call them internal and app-non-prod) with Terraform, Helm Provider and this Helm chart. My values.yaml on internal cluster looks like this:

vaules.yaml
global:
  name: consul
  enablePodSecurityPolicies: true
  image: consul:1.8.0-beta2
  imageK8S: hashicorp/consul-k8s:0.15.0
  tls:
    enabled: true
    enableAutoEncrypt: true
  acls:
    manageSystemACLs: true
    createReplicationToken: true
  gossipEncryption:
    secretName: consul-gossip
    secretKey: key
  federation:
    enabled: true
    createFederationSecret: true
  datadogAnnotations: &datadogAnnotations |
    ad.datadoghq.com/consul.logs: '[{ "source":"consul", "service":"consul" }]'
    ad.datadoghq.com/consul.init_configs: '[{}]'
    ad.datadoghq.com/consul.check_names: '["consul"]'
    ad.datadoghq.com/consul.instances: |
      [{
        "url": "https://%%host%%:8501",
        "acl_token": "ENC[consul_acl_token]",
        "tls_verify": false,
        "tls_ignore_warning": true
      }]

server:
  enabled: true
  extraConfig: |
    {
      "telemetry": {
        "dogstatsd_addr": "127.0.0.1:8125"
      }
    }
  annotations: *datadogAnnotations
  affinity: |
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: {{ template "consul.name" . }}
            release: "{{ .Release.Name }}"
            component: server
        topologyKey: kubernetes.io/hostname
  bootstrapExpect: 3
  connect: true
  replicas: 3
  resources: |
    requests:
      cpu: 10m
      memory: 200Mi
    limits:
      cpu: 100m
      memory: 600Mi
  storage: 10Gi

client:
  annotations: *datadogAnnotations
  enabled: true
  extraConfig: |
    {
      "telemetry": {
        "dogstatsd_addr": "127.0.0.1:8125"
      }
    }
  resources: |
    requests:
      cpu: 10m
      memory: 200Mi
    limits:
      cpu: 100m
      memory: 200Mi
  dns:
    enabled: true

ui:
  enabled: true

connectInject:
  enabled: true
  centralConfig:
    enabled: true

meshGateway:
  enabled: true
  globalMode: remote
  resources: |
    limits:
      cpu: 100m
      memory: 256Mi
    requests:
      cpu: 10m
      memory: 128Mi

syncCatalog:
  enabled: true
  toConsul: false
  toK8S: false

I also pass additional values via Terraform:

Terraform code
module "consul" {
  source = "../helm-release"

  helm_release_name = "consul"
  helm_chart        = "consul"
  helm_version      = "0.21.0"
  helm_repository   = "https://helm.releases.hashicorp.com"
  namespace         = var.namespace
  create_namespace  = true
  wait              = true

  values = file("${path.module}/values.yaml")

  set_values = {
    "global.datacenter"                  = var.datacenter
    "global.gossipEncryption.secretName" = kubernetes_secret.consul_gossip.metadata[0].name
    "meshGateway.service.annotations"    = "external-dns.alpha.kubernetes.io/hostname: ${var.datacenter_mesh_gateway_hostname}"
  }
}

I followed these instructions and created:

  • static-client on internal cluster with "consul.hashicorp.com/connect-service-upstreams": "static-server:1234:app-non-prod,static-server:1235" annotation
  • static-server on app-non-prod cluster
  • static-server on internal cluster

I removed default (anonymous-token-policy) permissions by changing it to an empty string.
I’m able to discover services between these clusters - consul-federation secret works and I’m able to list all services and nodes in WAN Federation.

Problem with default ACLs

I found out that local (internal to internal) Consul connect connections work perfectly fine, however remote (internal to app-non-prod) Consul connections result in an error:

root@static-client:/# curl localhost:1235 # This is server from local DC
"hello world"
root@static-client:/# curl localhost:1234 # This is server from another DC
curl: (56) Recv failure: Connection reset by peer

I managed to connect successfully from internal to app-non-prod after changing anonymous-token-policy to read-only:

node_prefix "" {
    policy = "read"
}
service_prefix "" {
    policy = "read"
}

Results are good:

root@static-client:/# curl localhost:1235 # This is server from local DC
"hello world"
root@static-client:/# curl localhost:1234 # This is server from another DC
"hello world"

Consul Connect began to work immediately (<3s) after I changed ACLs to allow anonymous read node/service access.

It makes me think that containers injected by connect-injector use anonymous token to obtain information about services running in other clusters. Is this intended?
I’ve just began to work with Consul last week and maybe I missed something in docs. I assumed that manageSystemACLs: true flag sets up all ACLs needed by injector, mesh gateways, default clients & servers and I know that logic hidden behind that flag definitely does the job for servers, clients and mesh gateways.

DNS

I also noticed that setting default ACL policy to empty value (as I wrote above) also results in no DNS entries being resolves. I followed Consul DNS - Kubernetes guide and there’s no mention of ACLs/tokens needed for DNS. I found this section in Production ACLs guide, however I’m pretty sure it should be handled “automagically” by manageSystemACLs: true flag.

Consul Connect ACL Tokens

My last question/problem I experience is a long list of Consul Connect leftover login tokens:

Shouldn’t they be deregistered? When I run consul monitor, I see:

2020-05-31T09:23:28.587Z [WARN]  agent: Service deregistration blocked by ACLs: service=static-client-static-client accessorID=00000000-0000-0000-0000-000000000002
2020-05-31T09:23:28.589Z [ERROR] agent.client: RPC failed to server: method=Catalog.Deregister server=10.15.5.196:8300 error="rpc error making call: rpc error making call: Permission denied"
2020-05-31T09:23:28.589Z [WARN]  agent: Service deregistration blocked by ACLs: service=static-client-static-client-sidecar-proxy accessorID=00000000-0000-0000-0000-000000000002
2020-05-31T09:23:28.590Z [ERROR] agent.client: RPC failed to server: method=Catalog.Deregister server=10.15.6.98:8300 error="rpc error making call: Permission denied"

I bet that as soon as I add write permissions, tokens will disappear. However it does not seem right to me - I believe that service deregistration should happen with accessorID of a service (as on my screenshot) and not anonymously.

Hi krzysztof, thanks for the detailed issue.

For remote connections, right now we require the anonymous token in the remote cluster to have those permissions. This is because the token used by the services in each cluster are local tokens that are only valid in that one cluster. However to make requests to other clusters they need to be able to have read permissions in that other cluster. The relevant consul issue is https://github.com/hashicorp/consul/issues/7381. In the future, we want to remove this requirement but this is the current workaround.

tl;dr It is intended.

For DNS, the anonymous policy is required to have read permissions (just like the connect services) because a DNS query cannot have an ACL token attached and so it is treated like an API call with no token. If you set dns.enabled to true in the helm chart we automatically configure the anonymous token. By setting it to the empty string this removes the dns behaviour.

For the Consul ACL tokens hanging around this is a bug (https://github.com/hashicorp/consul-k8s/issues/265). We have the fix in and it will be in the next release.

1 Like

Thank you a lot for such detailed answer! As I expose Consul UI for other teammates, I decided to allow it only for specific CIDR (our VPN), the rest of the world should see 403 now.

Regarding hanging tokens: Recently, one of my containers ended up being in a bootloop and I’ve got about ~1000 tokens instead of ~30 that are really used. I tried to clean them up manually like that:

function delete_consul_tokens {
  dc=$1
  servicename="$2"
  tokens=($(consul acl token list -datacenter="$dc" -format=json  | jq -r '.[] | select(.ServiceIdentities[0].ServiceName == "'"${servicename}"'") | .AccessorID'))
  for acl in $tokens; do
    consul acl token delete -datacenter="$dc" -id="$acl"
  done
}

delete_consul_tokens app-non-prod rabbitmq

As a result I saw Consul reporting that each token has been removed successfully (response is code: 200, with true in body). However, it looks like tokens are still there. I then tried to delete them using Consul UI (which sends the same HTTP request as consul acl token delete command) and… It also reports success, however nothing really changes - the token is still hanging there.:thinking:
I haven’t seen anything surprising in logs (like 403 because of ACL’s missing in underlying clients/servers) and I’m using bootstrap token myself. Do you have any hints of what’s preventing deletion of these tokens?