Consul agent failed to connect to cluster in kubernetes

I have consul cluster running in kubernetes with following configuration

manifest
apiVersion: v1
kind: ConfigMap
metadata:
  name: consul
  namespace: kube-public
data:
  config.json: |
    {
      "log_level": "INFO",
      "bind_addr": "0.0.0.0",
      "client_addr": "0.0.0.0",
      "disable_host_node_id": true,
      "data_dir": "/consul/data",
      "datacenter": "dev",
      "domain": "cluster.local",
      "ports": {
        "https": 8443
      },
      "server": true,
      "bootstrap_expect": 3,
      "retry_interval": "30s",
      "telemetry": {
        "prometheus_retention_time": "5m"
      },
      "ui": true
    }

---

kind: StatefulSet
apiVersion: apps/v1
metadata:
  name: consul
  namespace: kube-public
spec:
  serviceName: consul
  replicas: 3
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: consul
  template:
    metadata:
      labels:
        app: consul
    spec:
      securityContext:
        fsGroup: 1000
      containers:
        - name: consul
          image: consul:1.9
          imagePullPolicy: Always
          env:
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: GOSSIP_ENCRYPTION_KEY
              valueFrom:
                secretKeyRef:
                  name: consul
                  key: gossip-encryption-key
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          args:
            - "agent"
            - "-advertise=$(POD_IP)"
            - "-retry-join=consul.$(NAMESPACE).svc.cluster.local"
            # - "-retry-join=consul-0.consul.$(NAMESPACE).svc.cluster.local"
            # - "-retry-join=consul-1.consul.$(NAMESPACE).svc.cluster.local"
            # - "-retry-join=consul-2.consul.$(NAMESPACE).svc.cluster.local"
            - "-config-file=/etc/consul/config/config.json"
            - "-encrypt=$(GOSSIP_ENCRYPTION_KEY)"
          volumeMounts:
            - name: data
              mountPath: /consul/data
            - name: config
              mountPath: /etc/consul/config
          lifecycle:
            preStop:
              exec:
                command:
                - /bin/sh
                - -c
                - consul leave
          ports:
            - containerPort: 8500
              name: ui
            - containerPort: 8400
              name: alt
            - containerPort: 53
              name: udp
            - containerPort: 8443
              name: https
            - containerPort: 8080
              name: http
            - containerPort: 8301
              name: serflan
            - containerPort: 8302
              name: serfwan
            - containerPort: 8600
              name: consuldns
            - containerPort: 8300
              name: server
      volumes:
        - name: config
          configMap:
            name: consul
  volumeClaimTemplates:
    - metadata:
        name: data
        labels:
          app: consul
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: aws-gp2
        resources:
          requests:
            storage: 1Gi

---

apiVersion: v1
kind: Service
metadata:
  name: consul
  namespace: kube-public
  labels:
    name: consul
spec:
  clusterIP: None
  ports:
    - name: http
      port: 8500
      targetPort: 8500
    - name: https
      port: 8443
      targetPort: 8443
    - name: rpc
      port: 8400
      targetPort: 8400
    - name: serflan-tcp
      protocol: "TCP"
      port: 8301
      targetPort: 8301
    - name: serflan-udp
      protocol: "UDP"
      port: 8301
      targetPort: 8301
    - name: serfwan-tcp
      protocol: "TCP"
      port: 8302
      targetPort: 8302
    - name: serfwan-udp
      protocol: "UDP"
      port: 8302
      targetPort: 8302
    - name: server
      port: 8300
      targetPort: 8300
    - name: consuldns
      port: 8600
      targetPort: 8600
  selector:
    app: consul

---

Vault HA is running consul-agent as sidecar with following configuration

manifest
kind: ConfigMap
apiVersion: v1
metadata:
  name: vault-config
  namespace: dev-backend
  labels:
    app: vault
data:
  config.json: |
    {
      "listener": {
        "tcp":{
          "address": "0.0.0.0:8200",
          "tls_disable": "true"
        }
      },
      "storage": {
        "consul": {
          "address": "consul.kube-public.svc.cluster.local:8500",
          "path": "dev-vault/",
          "disable_registration": "true",
          "ha_enabled": "true"
        }
      },
      "max_lease_ttl": "720h",
      "default_lease_ttl": "336h",
      "ui": true
    }

---

kind: Deployment
apiVersion: apps/v1
metadata:
  name: vault
  namespace: dev-backend
  labels:
    app: vault
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vault
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: vault
    spec:
      containers:
      - name: vault
        image: vault:1.6.1
        imagePullPolicy: Always
        command: ["vault", "server", "-config", "/vault/config/config.json"]
        securityContext:
         capabilities:
           add:
             - IPC_LOCK
        env:
          - name: VAULT_ADDR
            value: 'http://127.0.0.1:8200'
        volumeMounts:
          - name: vault-config
            mountPath: /vault/config/config.json
            subPath: config.json
        ports:
          - name: vault
            containerPort: 8200
      - name: consul-agent
        image: consul:1.9
        imagePullPolicy: Always
        env:
          - name: NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          - name: GOSSIP_ENCRYPTION_KEY       # Required to connect to consul cluster
            valueFrom:
              secretKeyRef:
                name: consul
                key: gossip-encryption-key
        args:
          - "agent"
          - "-retry-join=consul.kube-public.svc.cluster.local"
          - "-encrypt=$(GOSSIP_ENCRYPTION_KEY)"
          - "-domain=cluster.local"
          - "-datacenter=dev"
          - "-disable-host-node-id"
          - "-node=vault-1"
      volumes:
      - name: vault-config
        configMap:
          name: vault-config
          items:
            - key: config.json
              path: config.json

---

apiVersion: v1
kind: Service
metadata:
  name: vault
  namespace: dev-backend
  labels:
    app: vault
spec:
  ports:
    - name: vault
      port: 8200
      targetPort: 8200
  selector:
    app: vault

--- 

When I performed upgrade on vault, pod restarted. Then it tried to connect to consul cluster, where I noticed following error message

[ERROR] agent.client.memberlist.lan: memberlist: Conflicting address for vault-1. Mine: 10.2.17.93:8301 Theirs: 10.2.18.5:8301 Old state: 0
[ERROR] agent.client.serf.lan: serf: Node name conflicts with another node at 10.2.18.5:8301. Names must be unique! (Resolution enabled: true)
...
[ERROR] agent.client: RPC failed to server: method=Catalog.Register server=10.2.18.46:8300 error="rpc error making call: rpc error making call: failed inserting node: Error while renaming Node ID: "18afec1a-ae83-1dde-1271-a77ebd26dbd5": Node name vault-1 is reserved by node d1eeb692-ea90-9eb3-a7b5-34082a297dfc with name vault-1 (10.2.18.5)"
[WARN]  agent: Syncing node info failed.: error="rpc error making call: rpc error making call: failed inserting node: Error while renaming Node ID: "18afec1a-ae83-1dde-1271-a77ebd26dbd5": Node name vault-1 is reserved by node d1eeb692-ea90-9eb3-a7b5-34082a297dfc with name vault-1 (10.2.18.5)"
[ERROR] agent.anti_entropy: failed to sync remote state: error="rpc error making call: rpc error making call: failed inserting node: Error while renaming Node ID: "18afec1a-ae83-1dde-1271-a77ebd26dbd5": Node name vault-1 is reserved by node d1eeb692-ea90-9eb3-a7b5-34082a297dfc with name vault-1 (10.2.18.5)"

and this consul-agent (vault-1) failed to show-up under Nodes on consul UI.

I have set -disable-host-node-id for consul-agent. What else I am missing here ?

Anyone can help on this ?

Hi @roy,

Maybe you have already fixed this issue, but sharing it here for future reference.

In Consul you can only register an agent with 1 single IP address at a given point in time. If you want to change the IP of a Consul agent, it should be only after the existing node has left the cluster.

In your case what is happening is, you have the Consul client in a Kubernetes Deployment, and the update strategy for the deployment is RollingUpdate. So when you make any change to the deployment, Kubernetes will create a new pod, and then once it passes the necessary health checks the old pod gets terminated. To add to this you are passing the -node name for the Consul Agent.

So when a rolling deployment starts, there is a new pod with Consul Client, trying to register to the server with the same node name (vault-1 in your case), and this node registration fails (as there is already an agent registered with the name vault-1.

The following are the sequence of events that happens in this scenario. The logs explain this better.

  1. Rolling deployment starts, a new pod tries to register to Consul. This is when the Consul server sees 2 IP addresses.

    # server log
    [ERROR] agent.server.memberlist.lan: memberlist: Conflicting address for vault-1. Mine: 10.44.0.4:8301 Theirs: 10.44.0.5:8301 Old state: 0
    [WARN]  agent.server.serf.lan: serf: Name conflict for 'vault-1' both 10.44.0.4:8301 and 10.44.0.5:8301 are claiming
    [ERROR] agent.server.memberlist.lan: memberlist: Conflicting address for vault-1. Mine: 10.44.0.4:8301 Theirs: 10.44.0.5:8301 Old state: 0
    [WARN]  agent.server.serf.lan: serf: Name conflict for 'vault-1' both 10.44.0.4:8301 and 10.44.0.5:8301 are claiming
    [ERROR] agent.server.memberlist.lan: memberlist: Conflicting address for vault-1. Mine: 10.44.0.4:8301 Theirs: 10.44.0.5:8301 Old state: 0
    [WARN]  agent.server.serf.lan: serf: Name conflict for 'vault-1' both 10.44.0.4:8301 and 10.44.0.5:8301 are claiming
    
  2. Rolling deployment completes, the old pod gets deleted. This is when Consul marks the client as failed. Here Kubernetes sends SIGTERM signal to the pod and this will make Consul do an ungraceful exit and not leave the cluster properly (you can handle this using the preStop hook).

    # server log
    [INFO]  agent.server.serf.lan: serf: EventMemberFailed: vault-1 10.44.0.4
    [INFO]  agent.server: member failed, marking health critical: member=vault-1
    
  3. Now there is only 1 IP claiming the name vault-1. So Consul Server decides to update the address of the Agent.

    # server log
    [INFO]  agent.server.memberlist.lan: memberlist: Updating address for left or failed node vault-1 from 10.44.0.4:8301 to 10.44.0.5:8301
    [INFO]  agent.server.serf.lan: serf: EventMemberJoin: vault-1 10.44.0.5
    [INFO]  agent.server: member joined, marking health alive: member=vault-1
    
  4. Interestingly, Consul Server now marks the agent as failed. The reason for this can be found in the client logs. The Serf subsystem is shut down because of name conflict resolution failure.

     # client log
     [WARN]  agent.client.serf.lan: serf: minority in name conflict resolution, quiting [0 / 1]
     [WARN]  agent.client.serf.lan: serf: Shutdown without a Leave
    

    And as a result, server is not able to talk to port 8301 which is the Serf LAN Port.

    [INFO]  agent.server.serf.lan: serf: attempting reconnect to vault-1 10.44.0.5:8301
    
1 Like

@Ranjandas

Thanks for explaining this scenario.

I was able to fix this by adding following to consul agent container

 strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1

...

        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - consul leave
1 Like