Restarting/Deleting one Consul pod causes complete outage

Hi there,

I’ve inherited a consul setup (v1.4.4 ) running an a kubernetes cluster.
In the current state deleting one of the pods causes the complete consul to be unreachable.

Firstly, while the deleted pod is still being terminated the consul requests are presumably still being redirected to it so that one get 502 Errorsm,while some requests go thorugh. There is no readiness probes defined in the statefulset which would explain this situation.

Secondly, shortly after the deletion of one pod the other pods start throwing the error
“Waiting for LAN peer consul-2.consul.production.svc…
ping: consul-2.consul.production.svc: Name does not resolve”
And the cluster becomes completely unready.

Adding a readiness probe to prevent the first situation, causes the cluster not to come up in the first place.

Any help, suggestion is appreciated.

The sts looks like the following

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    chart: consul-3.9.2
    component: consul-consul
    heritage: Tiller
    release: consul
  name: consul
  namespace: production
spec:
  podManagementPolicy: OrderedReady
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      component: consul-consul
      release: consul
  serviceName: consul
  template:
    metadata:
      creationTimestamp: null
      labels:
        chart: consul-3.9.2
        component: consul-consul
        heritage: Tiller
        release: consul
      name: consul
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: component
                  operator: In
                  values:
                  - consul-consul
              topologyKey: kubernetes.io/hostname
            weight: 1
      containers:
      - command:
        - /bin/sh
        - -ec
        - |
          set -o pipefail

          if [ -z "$POD_IP"  ]; then
            POD_IP=$(hostname -i)
          fi
          FQDN_SUFFIX="${STATEFULSET_NAME}.${STATEFULSET_NAMESPACE}.svc"
          NODE_NAME="$(hostname -s).${FQDN_SUFFIX}"
          if [ -e /etc/consul/secrets/gossip-key ]; then
            echo "{\"encrypt\": \"$(base64 /etc/consul/secrets/gossip-key)\"}" > /etc/consul/encrypt.json
            GOSSIP_KEY="-config-file /etc/consul/encrypt.json"
          fi

          JOIN_PEERS=""
          JOIN_PEERS=""
          for i in $( seq 0 $((${INITIAL_CLUSTER_SIZE} - 1)) ); do
            JOIN_PEERS="${JOIN_PEERS}${JOIN_PEERS:+ }${STATEFULSET_NAME}-${i}.${FQDN_SUFFIX}"
          done

          JOIN_PEERS=$( printf "%s\n" $JOIN_PEERS | sort | uniq )

          SUCCESS_LOOPS=5
          while [ "$SUCCESS_LOOPS" -gt 0 ]; do
            ALL_READY=true
            JOIN_LAN=""
            for THIS_PEER in $JOIN_PEERS; do
                if PEER_IP="$(ping -c 1 $THIS_PEER | awk -F'[()]' '/PING/{print $2}')" && [ "$PEER_IP" != "" ]; then
                  if [ "${PEER_IP}" != "${POD_IP}" ]; then
                    JOIN_LAN="${JOIN_LAN}${JOIN_LAN:+ } -retry-join=$THIS_PEER"
                  fi
                else
                  ALL_READY=false
                  break
                fi
            done
            if $ALL_READY; then
              SUCCESS_LOOPS=$(( SUCCESS_LOOPS - 1 ))
              echo "LAN peers appear ready, $SUCCESS_LOOPS verifications left"
            else
              echo "Waiting for LAN peer $THIS_PEER..."
            fi
            sleep 1s
          done


          WAN_PEERS=""

          JOIN_WAN=""
          SUCCESS_LOOPS=5
          while [ "$WAN_PEERS" != "" ] && [ "$SUCCESS_LOOPS" -gt 0 ]; do
            ALL_READY=true
            JOIN_WAN=""
            for THIS_PEER in $WAN_PEERS; do
                if PEER_IP="$( ( ping -c 1 $THIS_PEER || true ) | awk -F'[()]' '/PING/{print $2}')" && [ "$PEER_IP" != "" ]; then
                  if [ "${PEER_IP}" != "${POD_IP}" ]; then
                    JOIN_WAN="${JOIN_WAN}${JOIN_WAN:+ } -retry-join-wan=$THIS_PEER"
                  fi
                else
                  ALL_READY=false
                  break
                fi
            done
            if $ALL_READY; then
              SUCCESS_LOOPS=$(( SUCCESS_LOOPS - 1 ))
              echo "WAN peers appear ready, $SUCCESS_LOOPS verifications left"
            else
              echo "Waiting for WAN peer $THIS_PEER..."
            fi
            sleep 1s
          done

          exec /bin/consul agent \
            -config-dir /etc/consul/userconfig/consul-acl-tokens \
            -ui \
            -domain=consul \
            -data-dir=/var/lib/consul \
            -server \
            -bootstrap-expect=$( echo "$JOIN_PEERS" | wc -w ) \
            -disable-keyring-file \
            -bind=0.0.0.0 \
            -advertise=${POD_IP} \
            ${JOIN_LAN} \
            ${JOIN_WAN} \
            ${GOSSIP_KEY} \
            -client=0.0.0.0 \
            -dns-port=${DNSPORT} \
            -http-port=8500
        env:
        - name: INITIAL_CLUSTER_SIZE
          value: "3"
        - name: STATEFULSET_NAME
          value: consul
        - name: POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: STATEFULSET_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: DNSPORT
          value: "8600"
        image: consul:1.4.4
        imagePullPolicy: Always
        livenessProbe:
          exec:
            command:
            - consul
            - members
            - -http-addr=http://127.0.0.1:8500
          failureThreshold: 3
          initialDelaySeconds: 300
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        readinessProbe:
          exec:
            command:
            - echo 
            - "1"
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
#        readinessProbe:
#          exec:
#            command:
#            - /bin/sh
#            - -ec
#            - |
#              curl http://127.0.0.1:8500/v1/status/leader \
#              2>/dev/null | grep -E '".+"'
#          failureThreshold: 2
#          initialDelaySeconds: 20
#          periodSeconds: 3
#          successThreshold: 1
#          timeoutSeconds: 5
        name: consul
        ports:
        - containerPort: 8500
          name: http
          protocol: TCP
        - containerPort: 8400
          name: rpc
          protocol: TCP
        - containerPort: 8301
          name: serflan-tcp
          protocol: TCP
        - containerPort: 8301
          name: serflan-udp
          protocol: UDP
        - containerPort: 8302
          name: serfwan-tcp
          protocol: TCP
        - containerPort: 8302
          name: serfwan-udp
          protocol: UDP
        - containerPort: 8300
          name: server
          protocol: TCP
        - containerPort: 8600
          name: consuldns-tcp
          protocol: TCP
        - containerPort: 8600
          name: consuldns-udp
          protocol: UDP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/consul
          name: datadir
        - mountPath: /etc/consul/secrets
          name: gossip-key
          readOnly: true
        - mountPath: /etc/consul/userconfig/consul-acl-tokens
          name: userconfig-consul-acl-tokens
          readOnly: true
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1000
      terminationGracePeriodSeconds: 30
      volumes:
      - name: gossip-key
        secret:
          defaultMode: 420
          secretName: consul-gossip-key
      - name: userconfig-consul-acl-tokens
        secret:
          defaultMode: 420
          secretName: consul-acl-tokens
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      creationTimestamp: null
      name: datadir
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
      storageClassName: managed-nfs-storage
      volumeMode: Filesystem

Hi, it looks like you’re not using our official helm chart? consul 0.41.1 · hashicorp/hashicorp We’d need you to be using that to support you.

Thanks. I would immediately upgrade the stack using the official chart, if I wouldn’t be blocked. :smile:

Let’s say I do not want to have “support” but some insights or ideas on how to resolve the issue. :slight_smile:

Is there a config which would help at least to prevent the cluster from going down when one single pod is deleted?

I’m not sure.

ping: consul-2.consul.production.svc: Name does not resolve

Looks like a KubeDNS issue. Maybe you could deploy the official chart onto a separate cluster and then see what the differences are?

Ok, I did a little more digging.

The error above with name resolution appears after the remainig pods get restarted.

The name resolution error comes from the line

                if PEER_IP="$(ping -c 1 $THIS_PEER | awk -F'[()]' '/PING/{print $2}')" && [ "$PEER_IP" != "" ]; then

where it goes over the Peers whose names are built up by FQDN_SUFFIX="{STATEFULSET_NAME}.{STATEFULSET_NAMESPACE}.svc"

So, the scripts appends the pod name to the service name to build the peer addresses in a loop of the size of the cluster (in my case 3) then tries and pings the addresses. If the ping is successful for 5 times for all pods the agents is started.
Until the pods start running the name consul-*.consul.consul.svc is not resolved because there is no endpoint. This is why we get “Name does not resolve”. As soon as the pods are up the name resolution works again.

So, I guess it is not a KubeDNS issue :wink:

The issue is when I delete a pod, for some reason other pods die, as well. Wenn they restart the throw the above error, because the pods are not running.
Otherwise, it wouldn’t come to the error in a running pod, because the part is ran before rhe agent is started.

I will try and find the reason why the remaining pods fail and give an update here.