Hi, forgive me for my bad english. I’m trying to setup a 3 nodes server cluster on docker swarm. I have found a lot of example over the net and I created my config using these examples. I use the internal overlay for node comm.
When I start the stack for the first time, no problem, the cluster is OK:
/ # consul members
Node Address Status Type Build Protocol DC Segment
4cd0d1775e85 10.0.10.5:8301 alive server 1.9.0 2 dc <all>
b7bcd93a5168 10.0.10.3:8301 alive server 1.9.0 2 dc <all>
c6ea01c01f4d 10.0.10.4:8301 alive server 1.9.0 2 dc <all>
And i’ve tried multiple scenarios to see if the cluster comes back OK after a crash test, reboot, drain, etc.
-
Crash test, power off the VM --> The node becomes “failed” and after reboot, a new node joins the cluster. Cool OK!
-
Docker stop service. The node becomes “left” and when docker comes back, a new node joins the cluster. OK!
-
Drain one swarm node. When the node come back, the consul nodes can not join the cluster anymore.
==> Starting Consul agent... Version: '1.9.0' Node ID: 'b30c8602-bfd2-d8f2-da0d-4298468a9fdd' Node name: '4f834c9b8017' Datacenter: 'dc' (Segment: '<all>') Server: true (Bootstrap: false) Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600) Cluster Addr: 10.0.10.9 (LAN: 8301, WAN: 8302) Encrypt: Gossip: true, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false ==> Log data will now stream in as it occurs: 2020-12-03T12:53:00.638Z [WARN] agent: bootstrap_expect > 0: expecting 3 servers 2020-12-03T12:53:00.647Z [WARN] agent.auto_config: bootstrap_expect > 0: expecting 3 servers ==> Consul agent running! 2020-12-03T12:53:07.807Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader" 2020-12-03T12:53:08.054Z [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader= 2020-12-03T12:53:08.055Z [WARN] agent.server.raft: unable to get address for server, using fallback address: id=ef9e3ddb-fbf9-7da5-5c17-1e4701e7501c fallback=10.0.10.4:8300 error="Could not find address for server id ef9e3ddb-fbf9-7da5-5c17-1e4701e7501c" 2020-12-03T12:53:08.055Z [WARN] agent.server.raft: unable to get address for server, using fallback address: id=d61192f3-5895-5ada-6d60-df7fb00bfe43 fallback=10.0.10.5:8300 error="Could not find address for server id d61192f3-5895-5ada-6d60-df7fb00bfe43" 2020-12-03T12:53:13.671Z [WARN] agent.server.raft: Election timeout reached, restarting election 2020-12-03T12:53:13.672Z [WARN] agent.server.raft: unable to get address for server, using fallback address: id=ef9e3ddb-fbf9-7da5-5c17-1e4701e7501c fallback=10.0.10.4:8300 error="Could not find address for server id ef9e3ddb-fbf9-7da5-5c17-1e4701e7501c" 2020-12-03T12:53:13.672Z [WARN] agent.server.raft: unable to get address for server, using fallback address: id=d61192f3-5895-5ada-6d60-df7fb00bfe43 fallback=10.0.10.5:8300 error="Could not find address for server id d61192f3-5895-5ada-6d60-df7fb00bfe43"
My conf on all nodes:
{
"advertise_addr" : "{{ GetInterfaceIP \"eth0\" }}",
"bind_addr": "{{ GetInterfaceIP \"eth0\" }}",
"client_addr": "0.0.0.0",
"data_dir": "/consul/data",
"datacenter": "dc",
"leave_on_terminate" : true,
"disable_host_node_id" : true,
"disable_remote_exec": true,
"http_config": {
"response_headers": {
"Access-Control-Allow-Origin": "*"
}
},
"retry_interval" : "10s",
"retry_join" : [
"consul.server"
],
"ports" : {
"http" : 8500
},
"skip_leave_on_interrupt" : true,
"server_name" : "server.dc.consul",
"bootstrap_expect": 3,
"server" : true,
"ui_config": {
"enabled": true
},
"autopilot": {
"cleanup_dead_servers": true
},
"disable_update_check": true,
"telemetry": {
"disable_compat_1.9": true
},
"log_level": "warn",
"encrypt": "xxxxxxx"
}
and my compose file:
version: '3.8'
services:
server:
image: consul:1.9.0
networks:
consulnet:
aliases:
- consul.server
command: "consul agent -config-file /consul/config/config.json"
ports:
- "8500:8500"
volumes:
- /data/prod/todo/consul/config:/consul/config
- /data/prod/todo/consul/data:/consul/data
deploy:
mode: global
update_config:
parallelism: 1
failure_action: rollback
delay: 30s
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 120s
placement:
constraints:
- node.role == manager
networks:
consulnet:
Any idea please ??