Consul StatefulSet failing with error="No servers to join"

I am trying to spin up consul StatefulSet with no TLS but gossip-encryption

manifest
apiVersion: v1
kind: ConfigMap
metadata:
  name: consul-config
  namespace: dev-ethernet
data:
  server.json: |
    {
      "bind_addr": "0.0.0.0",
      "client_addr": "0.0.0.0",
      "disable_host_node_id": true,
      "data_dir": "/consul/data",
      "log_level": "INFO",
      "datacenter": "dc1",
      "domain": "cluster.local",
      "ports": {
        "http": 8500
      },
      "retry_join": [
        "provider=k8s label_selector=\"app=consul,component=server\""
      ],
      "server": true,
      "telemetry": {
        "prometheus_retention_time": "5m"
      },
      "ui": true
    }

---

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: consul
  namespace: dev-ethernet
spec:
  selector:
    matchLabels:
      app: consul
      component: server
  serviceName: consul
  podManagementPolicy: Parallel
  replicas: 3
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: consul
        component: server
      annotations:
        consul.hashicorp.com/connect-inject: "false"
    spec:
      serviceAccountName: consul
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: consul
                component: server
                release: consul
            topologyKey: kubernetes.io/hostname
      terminationGracePeriodSeconds: 10
      securityContext:
        fsGroup: 1000
      containers:
        - name: consul
          image: "consul:1.8"
          args:
            - "agent"
            - "-advertise=$(POD_IP)"
            - "-bootstrap-expect=3"
            - "-config-file=/etc/consul/config/server.json"
            - "-encrypt=$(GOSSIP_ENCRYPTION_KEY)"
          env:
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: GOSSIP_ENCRYPTION_KEY
              valueFrom:
                secretKeyRef:
                  name: consul-secret
                  key: consul-gossip-encryption-key
          volumeMounts:
            - name: data
              mountPath: /consul/data
            - name: config
              mountPath: /etc/consul/config
          lifecycle:
            preStop:
              exec:
                command:
                - /bin/sh
                - -c
                - consul leave
          ports:
            - containerPort: 8500
              name: ui-port
            - containerPort: 8400
              name: alt-port
            - containerPort: 53
              name: udp-port
            - containerPort: 8080
              name: http-port
            - containerPort: 8301
              name: serflan
            - containerPort: 8302
              name: serfwan
            - containerPort: 8600
              name: consuldns
            - containerPort: 8300
              name: server
      volumes:
        - name: config
          configMap:
            name: consul-config
  volumeClaimTemplates:
  - metadata:
      name: data
      labels:
        app: consul
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: aws-gp2
      resources:
        requests:
          storage: 3Gi

But getting following error when pod starts

==> Starting Consul agent...
           Version: 'v1.8.0'
           Node ID: '3b8399fb-f360-e280-2d2c-9b73cd5cc022'
         Node name: 'consul-0'
        Datacenter: 'dc1' (Segment: '<all>')
            Server: true (Bootstrap: false)
       Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
      Cluster Addr: 10.2.18.108 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: true, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false

==> Log data will now stream in as it occurs:

    2020-07-31T14:26:38.124Z [INFO]  agent.server.raft: initial configuration: index=0 servers=[]
    2020-07-31T14:26:38.125Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: consul-0.dc1 10.2.18.108
    2020-07-31T14:26:38.125Z [INFO]  agent.server.raft: entering follower state: follower="Node at 10.2.18.108:8300 [Follower]" leader=
    2020-07-31T14:26:38.127Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: consul-0 10.2.18.108
    2020-07-31T14:26:38.127Z [INFO]  agent.server: Adding LAN server: server="consul-0 (Addr: tcp/10.2.18.108:8300) (DC: dc1)"
    2020-07-31T14:26:38.127Z [INFO]  agent.server: Handled event for server in area: event=member-join server=consul-0.dc1 area=wan
    2020-07-31T14:26:38.127Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=udp
    2020-07-31T14:26:38.127Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=tcp
    2020-07-31T14:26:38.129Z [INFO]  agent: Started HTTP server: address=[::]:8500 network=tcp
    2020-07-31T14:26:38.129Z [INFO]  agent: started state syncer
==> Consul agent running!
    2020-07-31T14:26:38.129Z [INFO]  agent: Retry join is supported for the following discovery methods: cluster=LAN discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
    2020-07-31T14:26:38.129Z [INFO]  agent: Joining cluster...: cluster=LAN
    2020-07-31T14:26:38.178Z [INFO]  agent: Discovered servers: cluster=LAN cluster=LAN servers=
    2020-07-31T14:26:38.178Z [WARN]  agent: Join cluster failed, will retry: cluster=LAN retry_interval=30s error="No servers to join"
    2020-07-31T14:26:45.136Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
    2020-07-31T14:26:47.804Z [WARN]  agent.server.raft: no known peers, aborting election
    2020-07-31T14:27:08.190Z [INFO]  agent: Discovered servers: cluster=LAN cluster=LAN servers=

Would appreciate if anyone can point what is wrong here.

Hey @roy,

It’s OK to see this error as the consul servers startup because they start with no leader elected, and then eventually elect one. It should eventually become healthy. Are you seeing this error persist?

1 Like