Docker Pause is causing consul remote site failure

Hi Team,

We are using consul 1.14.1 version, Ours is a 3 node cluster and we have two site ( DC1 and DC2).

We faced an issue during our consul wan connectivity scenarios testing, We have performed multiple steps to validate and we got to know one easily reproducible issue, Below are the steps to reproduce the issue.

1.ungraceful stop of 2 consul containers on DC1 (using docker pause command / Sending SIGTOP signal to stop consul process)
2.It took sometime to get leader on DC1, But the leader got elected after few seconds. DC1 cluster is fine with one leader node.

[root@*** bin]# consul operator raft list-peers
Node ID Address State Voter RaftProtocol
ConsulPri3 **** ****:8300 leader true 3

*** DC2 cluster is running fine with 3 node cluster ***

During this scenario any request from Geo site to primary is not working, It is getting timed out by reaching out to the paused DC1 consul container.

Request from DC2 to DC1:
curl http://127.0.0.1:8500/v1/operator/autopilot/health?dc=1 --max-time 20
curl: (28) Operation timed out after 20001 milliseconds with 0 out of -1 bytes received

The same query is working fine locally from DC1. During this scenario the sending request is failing only from DC2 to DC1.

Could you please let us know if this issue can be fixed? Or let us know if any resolution can be provided from your side?

I’m noticing something very unusual in the early part of your message… you say that in your “DC1” you have a 3 node Consul cluster, that you have stopped 2 of the nodes, and after some time, the remaining single node became leader.

This is a malfunction of expected cluster behaviour.

In a 3 node cluster, you need 2 nodes to be up and able to communicate to elect a leader. If this is being violated, you have more fundamental issues in this Consul deployment that need to be addressed.

According to:

it seems your ConsulPri3 now believes it is the only member of a cluster of size 1.

Hi,

We are validating after the leader is elected in DC1, When only ConsulPri3 is the only member of a DC1 cluster, It is expected that the cluster is fine and ConsulPri3 should respond to all the requests. As Expected it is responding to all (DC1) local requests. And DC1 to DC2 Requests are going fine. But only from DC2 to DC1 all requests are getting timed out.

Please correct me if I am wrong somewhere.

There is a correction here.

1.ungraceful stop of 2 consul containers on DC1 (using docker pause command / Sending SIGTOP signal to stop consul process)
2.After few minutes we are forcefully making remaining one node as leader. DC1 cluster is fine with one leader node.

Ah, well that makes a lot more sense.

What exactly are you doing to forcefully remove the other nodes from the DC1 cluster?

As for the communication between DCs, this is supposed to be automatically managed by Consul’s memberlist/gossip protocol, called “serf”.

It might be informative to check the state of this, in both DCs, using the command:

consul members -wan -detailed

You should also ensure the log level is fairly high (debug, or even trace) and look at what the servers are logging about their communication with the other DC.

Below is the info which I took after reproducing the issue.

DC1 Wan details:

[***@ DC1 ]# consul members -wan -detailed
Node Address Status Tags
DC1server1 (DC1server1IP) :8302 alive acls=1,ap=default,bootstrap=1,build=1.12.0:09a8cdb4,dc=1,ft_fs=1,ft_si=1,id=52019630-880a-6f66-827f-c3afabeb5b13,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
DC2server1 (DC2server1IP):8302 alive acls=1,ap=default,bootstrap=1,build=1.12.0:09a8cdb4,dc=2,ft_fs=1,ft_si=1,id=b7d8ee37-c0c4-098e-7531-d23da4a6b704,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
DC2server2 (DC2server2IP) :8302 alive acls=1,ap=default,build=1.12.0:09a8cdb4,dc=2,ft_fs=1,ft_si=1,id=f44ed1f2-2347-868d-599d-240e3a26f6f8,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
DC2server3 (DC3server1IP) :8302 alive acls=1,ap=default,build=1.12.0:09a8cdb4,dc=2,ft_fs=1,ft_si=1,id=6666389a-3a9e-c4fc-3c9f-207f69182b04,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2

DC2 wan details: (Seems From DC2 stopped nodes are not cleaned up)

[ **@ DC2 ]# consul members -wan -detailed
Node Address Status Tags
DC1server1 (DC1server1IP) :8302 alive acls=1,ap=default,bootstrap=1,build=1.12.0:09a8cdb4,dc=1,ft_fs=1,ft_si=1,id=52019630-880a-6f66-827f-c3afabeb5b13,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
DC1server2 (DC1server2IP) :8302 failed acls=1,ap=default,build=1.12.0:09a8cdb4,dc=1,ft_fs=1,ft_si=1,id=33033aae-ff2a-cd67-0e47-e50d3704c6bd,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
DC1server3 (DC1server13IP) :8302 failed acls=1,ap=default,build=1.12.0:09a8cdb4,dc=1,ft_fs=1,ft_si=1,id=6cf087c7-af10-b1e0-0d3b-f825222d1d2d,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
DC2server1 (DC2server1IP) :8302 alive acls=1,ap=default,bootstrap=1,build=1.12.0:09a8cdb4,dc=2,ft_fs=1,ft_si=1,id=b7d8ee37-c0c4-098e-7531-d23da4a6b704,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
DC2server2 (DC2server2IP) :8302 alive acls=1,ap=default,build=1.12.0:09a8cdb4,dc=2,ft_fs=1,ft_si=1,id=f44ed1f2-2347-868d-599d-240e3a26f6f8,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
DC2server3 (DC2server3IP) :8302 alive acls=1,ap=default,build=1.12.0:09a8cdb4,dc=2,ft_fs=1,ft_si=1,id=6666389a-3a9e-c4fc-3c9f-207f69182b04,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2

Logs from DC2:

2023-05-18T15:47:50.731+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server2 DC1server2IP :8302
2023-05-18T15:48:50.734+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server2 DC1server2IP :8302
2023-05-18T15:50:50.738+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server3 DC1server3IP :8302
2023-05-18T15:51:50.741+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server2 DC1server2IP :8302
2023-05-18T15:52:50.744+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server3 DC1server3IP :8302
2023-05-18T15:54:20.748+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server3 DC1server3IP :8302
2023-05-18T15:55:20.750+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server3 DC1server3IP :8302
2023-05-18T15:56:50.752+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server3 DC1server3IP :8302
2023-05-18T15:57:50.758+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server2 DC1server2IP :8302

No Logs from DC1 related to WAN join

At this point, I think you have enough information pointing to Consul not quite behaving correctly, that it would be worth opening a bug report in GitHub.

Specifically, your latest message shows that your stopped servers are removed from the serf membership set in DC1, yet are remaining there in DC2.

I would have hoped that since they are showing status failed there, that would have been enough to have Consul not try to use them - but you showed by demonstrating that cross-DC requests were timing out, that this isn’t happening.

Thank you for your response, I have raised a bug on github.

https://github.com/hashicorp/consul/issues/17403 , Let followup here