Strange Agent Errors with agent.anti_entropy

I have a federated cluster running in kubernetes across two different regions. I’m seeing the following over a couple minute span in the secondary data center. I don’t see any issue with the cluster as this is happening and there are no relevant leave or join events happening at the same time as these errors.

[ERROR] agent.anti_entropy: failed to sync remote state: error="ACL not found"
[ERROR] agent.anti_entropy: failed to sync remote state: error="ACL not found"
[ERROR] agent: Coordinate update error: error="rpc error making call: ACL not found"
[ERROR] agent.anti_entropy: failed to sync remote state: error="ACL not found"
[ERROR] agent: Coordinate update error: error="rpc error making call: ACL not found"
[ERROR] agent.anti_entropy: failed to sync remote state: error="ACL not found"

consul keyring -list does not show any different information in either DC. I’ve run debug sessions and captured these errors, but there’s no supporting information as to what might be happening.

consul version 1.8.8

Hey Duncan!

First off welcome to the consul community! Could you provide us with your config files for the primary and secondary data centers?

Also, how did you set the ACL tokens on your initial setup?

Hi @Amier thanks for getting back to me.

Primary helm config:

  chart:
    repository: https://helm.releases.hashicorp.com
    version: v0.21.0
    name: consul
  values:
    global:
      datacenter: primary
      image: "consul:1.8.8"
      imageK8S: "hashicorp/consul-k8s:0.15.0"
      imageEnvoy: "envoyproxy/envoy:v1.13.0"
      enablePodSecurityPolicies: true

      gossipEncryption:
        secretName: consul-gossip-encryption-key
        secretKey: key

      tls:
        enabled: true
        verify: true
        httpsOnly: true
        caCert:
          secretName: consul-ca-cert
          secretKey: 'tls.crt'
        caKey:
          secretName: consul-ca-key
          secretKey: 'tls.key'

      acls:
        manageSystemACLs: true
        createReplicationToken: true

      federation:
        enabled: true
        createFederationSecret: true

    connectInject:
      enabled: true

    meshGateway:
      enabled: true
      imageEnvoy: "envoyproxy/envoy:v1.13.0"
      wanAddress:
        source: Static
        static: 'consul-mesh'
      service:
        enabled: true
        type: LoadBalancer
        annotations: |-
          external-dns-enabled: "true"
          external-dns.alpha.kubernetes.io/hostname: "consul-mesh"
          service.beta.kubernetes.io/aws-load-balancer-internal: "true"
          service.beta.kubernetes.io/aws-load-balancer-security-groups: "sg-aaa"
        additionalSpec: |-
          loadBalancerSourceRanges:
          - "10.0.0.0/13"

    server:
      replicas: 5
      updatePartition: 5
      storage: 50Gi
      annotations: |
        "prometheus.io/scrape": "true"
        "prometheus.io/port": "8501"
        "prometheus.io/scheme": "https"
      extraConfig: |
        {
          "telemetry": {
            "disable_hostname": true,
            "prometheus_retention_time": "6h"
          },
          "auto_encrypt": {
            "allow_tls": true
          }
        }

    client:
      annotations: |
        "prometheus.io/scrape": "true"
        "prometheus.io/port": "8501"
        "prometheus.io/scheme": "https"
      extraConfig: |
        {
          "telemetry": {
            "disable_hostname": true,
            "prometheus_retention_time": "6h"
          }
        }
      tolerations: |
        - operator: Exists

    syncCatalog:
      enabled: true
      toConsul: true
      toK8S: false
      k8sAllowNamespaces: ['default']
      addK8SNamespaceSuffix: false
      consulPrefix: 'kubernetes-'

Secondary helm config:

  chart:
    repository: https://helm.releases.hashicorp.com
    version: v0.21.0
    name: consul
  values:
    global:
      datacenter: secondary
      image: "consul:1.8.8"
      imageK8S: "hashicorp/consul-k8s:0.15.0"
      imageEnvoy: "envoyproxy/envoy:v1.13.0"
      enablePodSecurityPolicies: true

      # enable gossip encryption
      gossipEncryption:
        secretName: hashicorp-consul-federation
        secretKey: gossipEncryptionKey

      tls:
        enabled: true
        verify: true
        httpsOnly: true
        caCert:
          secretName: hashicorp-consul-federation
          secretKey: caCert
        caKey:
          secretName: hashicorp-consul-federation
          secretKey: caKey

      acls:
        manageSystemACLs: true
        replicationToken:
          secretName: hashicorp-consul-federation
          secretKey: replicationToken

      federation:
        enabled: true

    connectInject:
      enabled: true

    meshGateway:
      enabled: true
      imageEnvoy: "envoyproxy/envoy:v1.13.0"
      wanAddress:
        source: Static
        static: 'consul-mesh.b'
      service:
        enabled: true
        type: LoadBalancer
        annotations: |-
          external-dns-enabled: "true"
          external-dns.alpha.kubernetes.io/hostname: "consul-mesh.b"
          service.beta.kubernetes.io/aws-load-balancer-internal: "true"
          service.beta.kubernetes.io/aws-load-balancer-security-groups: "sg-bbbb"
        additionalSpec: |-
          loadBalancerSourceRanges:
          - "10.0.0.0/13"

    ui:
      enabled: true
      service:
        enabled: true
        type: null

    server:
      replicas: 5
      updatePartition: 5
      storage: 50Gi
      annotations: |
        "prometheus.io/scrape": "true"
        "prometheus.io/port": "8501"
        "prometheus.io/scheme": "https"
      extraConfig: |
        {
          "telemetry": {
            "disable_hostname": true,
            "prometheus_retention_time": "6h"
          },
          "auto_encrypt": {
            "allow_tls": true
          }
        }
      extraVolumes:
      - type: secret
        name: hashicorp-consul-federation
        items:
        - key: serverConfigJSON
          path: config.json
        load: true

    client:
      annotations: |
        "prometheus.io/scrape": "true"
        "prometheus.io/port": "8501"
        "prometheus.io/scheme": "https"
      extraConfig: |
        {
          "telemetry": {
            "disable_hostname": true,
            "prometheus_retention_time": "6h"
          }
        }
      tolerations: |
        - operator: Exists

    syncCatalog:
      enabled: true
      toConsul: true
      toK8S: false
      k8sAllowNamespaces: ['default']
      addK8SNamespaceSuffix: false
      consulPrefix: 'kubernetes-'

Our documentation says that this was done when setting up federation:

consul keyring -install <federated-key> -token <bootstrap-token>
consul keyring -use <federated-key> -token <bootstrap-token>
consul keyring -remove <old-key> -token <bootstrap-token>

Hi, not sure the issue but the logs are about ACLs, not the consul keyring which is used for gossip encryption.

I think this error is saying that the secondary DC servers don’t have a valid ACL token set.

@duncaan

I can’t find the specific PR, but I think this issue was fixed somewhere in between 1.8.8 and the latest version. I would recommend upgrading consul versions ( upgrading docs linked here ). If that doesn’t solve the issue I can try my hand at replicating it so we can figure out if this is a bug that’ll need to be fixed.

Thank you for your patience!