Hello,
New consul server cannot join consul cluster.
Here is my scenario:
We have a 4 node consul server cluster working in production. One of the instances has scheduled maintenance (AWS degraded hardware) and we need to stop/start that instance.
As this is a production cluster, I want to add another node to have the option to have two node failures (with 5 nodes) instead of one (with 4 nodes) while performing the stop/start of the affected node. Consul version is 0.9.3 on all server nodes.
I have created a new instance of consul server and this is the consul command to start the new server (and the same command other nodes use (except bootstrap node):
consul agent -server -advertise=172.25.1.6 -retry-join=172.25.1.4
(exact command is the one bellow:)
docker run -d -v /etc/localtime:/etc/localtime:ro -v $(pwd)/consul-data:/consul/data --restart=unless-stopped --net=host consul:${version} agent -server -advertise=${advertise} -retry-join=${retry-join} -datacenter=${datacenter} -log-level=${log-level} -data-dir=/consul/data
Cluster IP addresses are:
172.25.1.4 (this is the bootstrap server and the one specified in -retry-join)
172.25.1.5
172.25.2.4
172.25.2.5
And the new node is 172.25.1.6
After creation, the new consul server cannot join the cluster.
Here are part of the logs in 172.25.1.6 (new consul server):
- Failed to join 172.25.1.4: dial tcp 172.25.1.4:8301: i/o timeout
2020/09/16 18:16:54 [WARN] agent: Join LAN failed: , retrying in 30s
2020/09/16 18:16:56 [ERR] agent: failed to sync remote state: No cluster leader
2020/09/16 18:17:02 [ERR] agent: Coordinate update error: No cluster leader
2020/09/16 18:17:20 [ERR] agent: failed to sync remote state: No cluster leader
2020/09/16 18:17:24 [INFO] agent: (LAN) joining: [172.25.1.4]
2020/09/16 18:17:34 [INFO] agent: (LAN) joined: 0 Err: 1 error(s) occurred:
This new server has IP address 172.25.1.6 and the retry-join is to 172.25.1.4, so as you can see 172.25.1.6 cannot reach 172.25.1.4.
From 172.25.1.6, I can connect to 172.25.1.5, but not to 172.25.1.4
(connection to 172.25.1.5 works:)
$ telnet 172.25.1.5 8301
Trying 172.25.1.5...
Connected to 172.25.1.5.
Escape character is '^]'.
(connection to 172.25.1.4, does not work:)
$ telnet 172.25.1.4 8301
Trying 172.25.1.4...
(they are on the same subnet, it can connect to 1.5, should be able to connect to 1.4…)
These 4 nodes have the same security group and have ports TCP 8400, 8500, 8300-8302, and 8600 open to the members of that security group.
UDP ports: 8301-8302 and 8600. (as the new node has the same security group as the other nodes in the cluster, I don’t think there is a problem with a port being blocked)
Also checked NACLS for booth instances (new node and bootstrap node)
I also made a test in a staging environment with a similar configuration and a new node joins the cluster without a problem (also can telnet to the retry-join node specified)
Any idea why the new node can’t connect to the node specified in the retry-join and in consequence cannot join the cluster?
(other servers nodes are already connected to 172.25.1.4…, for example, 172.25.1.5 same subnet as 172.25.1.6, same security group…)
I thought trying another address in the retry-join instead of the 172.25.1.4, as I can telnet other nodes in port 8301, I suppose new node may join those server nodes. What I am concerned it that new node cannot connect to 172.25.1.4, and I don’t know if this could cause cluster misconfiguration.
I suppose it is safe to stop/start the instance that it has scheduled maintenance and have a three node cluster while doing the stop/start of the instance, but I prefer to have another node so that in case another node fails, the cluster doesn’t run out of quorum.
Is it safe to try the retry-join to another node in the cluster instead of the 172.25.1.4, even if the new node cannot connect to the bootstrap node?
Thanks a lot!