Consul on swarm cluster

Hi, forgive me for my bad english. I’m trying to setup a 3 nodes server cluster on docker swarm. I have found a lot of example over the net and I created my config using these examples. I use the internal overlay for node comm.

When I start the stack for the first time, no problem, the cluster is OK:

/ # consul members
Node          Address         Status  Type    Build  Protocol  DC    Segment
4cd0d1775e85  10.0.10.5:8301  alive   server  1.9.0  2         dc  <all>
b7bcd93a5168  10.0.10.3:8301  alive   server  1.9.0  2         dc  <all>
c6ea01c01f4d  10.0.10.4:8301  alive   server  1.9.0  2         dc  <all>

And i’ve tried multiple scenarios to see if the cluster comes back OK after a crash test, reboot, drain, etc.

  • Crash test, power off the VM --> The node becomes “failed” and after reboot, a new node joins the cluster. Cool OK!

  • Docker stop service. The node becomes “left” and when docker comes back, a new node joins the cluster. OK!

  • Drain one swarm node. When the node come back, the consul nodes can not join the cluster anymore.

    ==> Starting Consul agent...
             Version: '1.9.0'
             Node ID: 'b30c8602-bfd2-d8f2-da0d-4298468a9fdd'
           Node name: '4f834c9b8017'
          Datacenter: 'dc' (Segment: '<all>')
              Server: true (Bootstrap: false)
         Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
        Cluster Addr: 10.0.10.9 (LAN: 8301, WAN: 8302)
             Encrypt: Gossip: true, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false
    
    ==> Log data will now stream in as it occurs:
    
      2020-12-03T12:53:00.638Z [WARN]  agent: bootstrap_expect > 0: expecting 3 servers
      2020-12-03T12:53:00.647Z [WARN]  agent.auto_config: bootstrap_expect > 0: expecting 3 servers
    ==> Consul agent running!
      2020-12-03T12:53:07.807Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
      2020-12-03T12:53:08.054Z [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader=
      2020-12-03T12:53:08.055Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=ef9e3ddb-fbf9-7da5-5c17-1e4701e7501c fallback=10.0.10.4:8300 error="Could not find address for server id ef9e3ddb-fbf9-7da5-5c17-1e4701e7501c"
      2020-12-03T12:53:08.055Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=d61192f3-5895-5ada-6d60-df7fb00bfe43 fallback=10.0.10.5:8300 error="Could not find address for server id d61192f3-5895-5ada-6d60-df7fb00bfe43"
      2020-12-03T12:53:13.671Z [WARN]  agent.server.raft: Election timeout reached, restarting election
      2020-12-03T12:53:13.672Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=ef9e3ddb-fbf9-7da5-5c17-1e4701e7501c fallback=10.0.10.4:8300 error="Could not find address for server id ef9e3ddb-fbf9-7da5-5c17-1e4701e7501c"
      2020-12-03T12:53:13.672Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=d61192f3-5895-5ada-6d60-df7fb00bfe43 fallback=10.0.10.5:8300 error="Could not find address for server id d61192f3-5895-5ada-6d60-df7fb00bfe43"
    

My conf on all nodes:

{
  "advertise_addr" : "{{ GetInterfaceIP \"eth0\" }}",
  "bind_addr": "{{ GetInterfaceIP \"eth0\" }}",
  "client_addr": "0.0.0.0",
  "data_dir": "/consul/data",
  "datacenter": "dc",
  "leave_on_terminate" : true,
  "disable_host_node_id" : true,
  "disable_remote_exec": true,
  "http_config": {
    "response_headers": {
      "Access-Control-Allow-Origin": "*"
    }
  },
  "retry_interval" : "10s",
  "retry_join" : [
    "consul.server"
  ],
  "ports" : {
    "http" : 8500
  },
  "skip_leave_on_interrupt" : true,
  "server_name" : "server.dc.consul",
  "bootstrap_expect": 3,
  "server" : true,
  "ui_config": {
    "enabled": true
  },
  "autopilot": {
    "cleanup_dead_servers": true
  },
  "disable_update_check": true,
  "telemetry": {
    "disable_compat_1.9": true
  },
  "log_level": "warn",
  "encrypt": "xxxxxxx"
}

and my compose file:

version: '3.8'

services:
  server:
    image: consul:1.9.0
    networks:
      consulnet:
        aliases:
          - consul.server
    command: "consul agent -config-file /consul/config/config.json"
    ports:
      - "8500:8500"
    volumes:
      - /data/prod/todo/consul/config:/consul/config
      - /data/prod/todo/consul/data:/consul/data
    deploy:
      mode: global
      update_config:
        parallelism: 1
        failure_action: rollback
        delay: 30s
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s
      placement:
        constraints:
          - node.role == manager

networks:
  consulnet:

Any idea please ??