New consul server cannot join consul cluster

Hello,

New consul server cannot join consul cluster.

Here is my scenario:

We have a 4 node consul server cluster working in production. One of the instances has scheduled maintenance (AWS degraded hardware) and we need to stop/start that instance.
As this is a production cluster, I want to add another node to have the option to have two node failures (with 5 nodes) instead of one (with 4 nodes) while performing the stop/start of the affected node. Consul version is 0.9.3 on all server nodes.

I have created a new instance of consul server and this is the consul command to start the new server (and the same command other nodes use (except bootstrap node):

consul agent -server -advertise=172.25.1.6 -retry-join=172.25.1.4

(exact command is the one bellow:)

docker run -d -v /etc/localtime:/etc/localtime:ro -v $(pwd)/consul-data:/consul/data --restart=unless-stopped --net=host consul:${version} agent -server -advertise=${advertise} -retry-join=${retry-join} -datacenter=${datacenter} -log-level=${log-level} -data-dir=/consul/data

Cluster IP addresses are:

172.25.1.4 (this is the bootstrap server and the one specified in -retry-join)
172.25.1.5
172.25.2.4
172.25.2.5

And the new node is 172.25.1.6

After creation, the new consul server cannot join the cluster.
Here are part of the logs in 172.25.1.6 (new consul server):

  • Failed to join 172.25.1.4: dial tcp 172.25.1.4:8301: i/o timeout
    2020/09/16 18:16:54 [WARN] agent: Join LAN failed: , retrying in 30s
    2020/09/16 18:16:56 [ERR] agent: failed to sync remote state: No cluster leader
    2020/09/16 18:17:02 [ERR] agent: Coordinate update error: No cluster leader
    2020/09/16 18:17:20 [ERR] agent: failed to sync remote state: No cluster leader
    2020/09/16 18:17:24 [INFO] agent: (LAN) joining: [172.25.1.4]
    2020/09/16 18:17:34 [INFO] agent: (LAN) joined: 0 Err: 1 error(s) occurred:

This new server has IP address 172.25.1.6 and the retry-join is to 172.25.1.4, so as you can see 172.25.1.6 cannot reach 172.25.1.4.

From 172.25.1.6, I can connect to 172.25.1.5, but not to 172.25.1.4

(connection to 172.25.1.5 works:)

$ telnet 172.25.1.5 8301
Trying 172.25.1.5...
Connected to 172.25.1.5.
Escape character is '^]'.

(connection to 172.25.1.4, does not work:)

$ telnet 172.25.1.4 8301
Trying 172.25.1.4...

(they are on the same subnet, it can connect to 1.5, should be able to connect to 1.4…)

These 4 nodes have the same security group and have ports TCP 8400, 8500, 8300-8302, and 8600 open to the members of that security group.
UDP ports: 8301-8302 and 8600. (as the new node has the same security group as the other nodes in the cluster, I don’t think there is a problem with a port being blocked)

Also checked NACLS for booth instances (new node and bootstrap node)

I also made a test in a staging environment with a similar configuration and a new node joins the cluster without a problem (also can telnet to the retry-join node specified)

Any idea why the new node can’t connect to the node specified in the retry-join and in consequence cannot join the cluster?
(other servers nodes are already connected to 172.25.1.4…, for example, 172.25.1.5 same subnet as 172.25.1.6, same security group…)

I thought trying another address in the retry-join instead of the 172.25.1.4, as I can telnet other nodes in port 8301, I suppose new node may join those server nodes. What I am concerned it that new node cannot connect to 172.25.1.4, and I don’t know if this could cause cluster misconfiguration.

I suppose it is safe to stop/start the instance that it has scheduled maintenance and have a three node cluster while doing the stop/start of the instance, but I prefer to have another node so that in case another node fails, the cluster doesn’t run out of quorum.

Is it safe to try the retry-join to another node in the cluster instead of the 172.25.1.4, even if the new node cannot connect to the bootstrap node?

Thanks a lot!

Thanks for using Consul, and also for the very detailed message. Like you, I am concerned that, even if using a different server in the retry-join works, you’ll still have issues if you can’t talk to the leader.

Out of curiosity, can you telnet to 172.25.1.4 from some other server? I realize you said they are able to connect in general, but I am thinking testing telnet specifically could rule out any red herrings. It would be nice to make sure this new node is the only one experiencing this very specific behavior.

Given that your staging environment works as expected, and that I can’t share a real time debugging session with you, all my instinct has me leaning toward an environmental variance, which, sadly, is tough to spot.

What about adding yet another server in the production environment following your setup steps and seeing if it can connect?

Hello Derek,

Thanks for your reply.

172.25.1.4 is not the leader, the leader right now is leader_addr = 172.25.2.4:8300, 172.25.1.4 is the bootstrap server.

Yes, all the nodes in the cluster (4) can telnet between each other to ports 8300, 80301, 8302.

New created node can connect to all 3 nodes, except 172.25.1.4
and node 172.25.1.4 can’t connect to 172.25.1.7 (new node).

[centos@consul-server-172-25-1-4 log]$ telnet 172.25.1.7 8301
Trying 172.25.1.7…
telnet: connect to address 172.25.1.7: No route to host

[centos@consul-server-172-25-1-4 log]$ telnet 172.25.1.5 8301
Trying 172.25.1.5…
Connected to 172.25.1.5.
Escape character is ‘^]’.

[centos@consul-server-172-25-2-5 ~] telnet 172.25.1.7 8301 Trying 172.25.1.7... Connected to 172.25.1.7. Escape character is '^]'. ^CConnection closed by foreign host. [centos@consul-server-172-25-2-5 ~] telnet 172.25.2.4 8301
Trying 172.25.2.4…
Connected to 172.25.2.4.
Escape character is ‘^]’.

All servers in segment 172.25.1.0/24 has the same routing table.

this is the logs from consul in the new consul server:

2020/09/17 03:05:11 [INFO] serf: EventMemberJoin: consul-server-172-25-1-7 172.25.1.7

2020/09/17 03:05:11 [INFO] agent: Started HTTP server on 127.0.0.1:8500
2020/09/17 03:05:11 [INFO] agent: Retry join LAN is supported for: aws azure gce softlayer
2020/09/17 03:05:11 [INFO] agent: Joining LAN cluster...
2020/09/17 03:05:11 [INFO] agent: (LAN) joining: [172.25.1.4]
2020/09/17 03:05:18 [ERR] agent: failed to sync remote state: No cluster leader
2020/09/17 03:05:18 [WARN] raft: no known peers, aborting election

I don’t know why it cannot connect to 172.25.1.4 …

I created another instance as you suggested, but it is the same, cannot connect to 172.25.1.4

2020/09/17 20:47:26 [INFO] agent: Retry join LAN is supported for: aws azure gce softlayer
2020/09/17 20:47:26 [INFO] agent: Joining LAN cluster…
2020/09/17 20:47:26 [INFO] agent: (LAN) joining: [172.25.1.4]
2020/09/17 20:47:33 [ERR] agent: failed to sync remote state: No cluster leader
2020/09/17 20:47:35 [WARN] raft: no known peers, aborting election
2020/09/17 20:47:36 [INFO] agent: (LAN) joined: 0 Err: 1 error(s) occurred:

  • Failed to join 172.25.1.4: dial tcp 172.25.1.4:8301: i/o timeout
    2020/09/17 20:47:36 [WARN] agent: Join LAN failed: , retrying in 30s

I even created the instance on another subnet, but it is the same cannot connect to 172.25.1.4

172.25.1.4 is the instance with the scheduled maintenance.

from other nodes i can reach 172.25.1.4,
from the new node i can reach all 4 nodes, except for 172.25.1.4

Thank you Derek!

Hello Derek,

I was wondering what will happen if I stop the bootstrap node instance in the cluster?
It will join again on start?

I think the cluster will be up as there are 3 nodes up.

But I am not sure what will happen when the bootstrap node is up again.

I have stopped server nodes and when the instance is started, the instance joins the cluster again.

Thank you!

You can specify retry-join multiple times, and the agent trying to join will try until it finds one it can succeed with. See this link for an example.

Manual bootstrapping is currently discouraged. Have you read this document? It looks automatic bootstrapping is a feature with your version of Consul. Do I understand correctly that you are trying to manually bootstrap, or are you just concerned what will happen if you remove your current leader? If a leader leaves, this should force a new election. When you bring the former leader back online, you will have to tell it which server(s) to join, and it will join as a follower.

Hi Derek,

Thank you replaying and the links.

My concern is what will happen with the consul server node that we need to stop / start. ?

This node is the one that we used to initially automatically bootstrap the consul server. Right now this node is not the leader. We are running consul in docker containers, and the configuration to run this bootstrap consul server is:

consul agent -server -advertise=172.25.1.4 -datacenter=dc1 -bootstrap-expect=4 -log-level=info -data-dir=/consul/data

This is the only server that starts with the bootstrap-expect option. And does not have a retry-join option.

nodes in our cluster:

172.25.1.4. (node used to bootstrap the consul cluster) (will have to stop / start this instance)
172.25.1.5
172.25.2.4 (current leader)
172.25.2.5

As this node does not have a retry-join I don’t know if it will re-join the cluster… (also it has the option bootstrap-expect “baked”… with cloud-init, it starts the consul container with this option (i was thinking only specifying retry-join…)

Does serf on the other consul server nodes, will attempt to reconnect to the bootstrap server node?
(as this is what I see when a consul server nodes losses connection to the cluster, for example if the instance is stopped)

If the 172.25.1.4 does not joins the cluster, I am thinking in executing one of these:

  • connect to the docker container in 172.25.1.4 and do:

consul join 172.25.1.5 172.25.2.4 172.25.2.5

or

  • stop consul containers in nodes:
    172.25.1.5
    172.25.2.4
    172.25.2.5

Once consul in 172.25.1.4 is ready, start consul containers (172.25.1.5, 172.25.2.4
172.25.2.5)

I would like to avoid this option as I don’t want the cluster to be in broken (inconsistent or with data corruption)

Thanks a lot Derek!