Hello,
Since few days we encountered a strange issue on our Non-prod Consul Cluster.
All our Servers nodes appears twice but I didn’t exactly understand why.
If we are listing members we can see :
-bash-4.2$ $HOME/consul/bin/consul members -http-addr=https://127.0.0.1:8543
csldevparmut01 192.160.76.22:8301 alive server 1.3.1 3 eu-fr-paris
csldevparmut01.eu-fr-paris 192.160.76.22:8302 alive server 1.3.1 3 eu-fr-paris
csldevparmut02 192.160.76.26:8301 alive server 1.3.1 3 eu-fr-paris
csldevparmut03 192.162.192.87:8301 alive server 1.3.1 3 eu-fr-paris
csldevparmut03.eu-fr-paris 192.162.192.87:8302 alive server 1.3.1 3 eu-fr-paris
csldevparmut04 192.162.192.88:8301 alive server 1.3.1 3 eu-fr-paris
csldevparmut04.eu-fr-paris 192.162.192.88:8302 alive server 1.3.1 3 eu-fr-paris
csldevparmut05 184.7.121.73:8301 alive server 1.3.1 3 eu-fr-paris
csldevparmut05.eu-fr-paris 184.7.121.73:8302 alive server 1.3.1 3 eu-fr-paris
When getting the detailed we can see that it’s kind of the same except for the wan_port :
csldevparmut05 184.7.121.73:8301 failed build=1.3.1:f2b13f30,dc=eu-fr-paris,expect=5,id=7503ec82-d145-c0be-71dd-2eb133d982d2,port=8300,raft_vsn=3,role=consul,segment=,use_tls=1,vsn=3,vsn_max=3,vsn_min=2,wan_join_port=8302
csldevparmut05.eu-fr-paris 184.7.121.73:8302 failed build=1.3.1:f2b13f30,dc=eu-fr-paris,expect=5,id=7503ec82-d145-c0be-71dd-2eb133d982d2,port=8300,raft_vsn=3,role=consul,segment=,use_tls=1,vsn=3,vsn_max=3,vsn_min=2
At the beginning I simply tried to restart one of the servers (not the leader) but it didn’t change anything. I tried to stop it again, then force-leave still the same manager without any improvement.
At the beginning both appeared as left and once I started again Consul process both were alive.
After I tried to stop again Consul on this server, force-leave, update the wan_port from 8302 to 8322 on the config.json file.
I believed it fixed it (only the “good” nodename aka “csldevparmut05” appeared as alive but after few seconds both (csldevparmut05 & csldevparmut05.eu-fr-paris) appeared alive with the new wan_port (8322).
I finally stopped Consul again, force-leave, disabled the wan_port on this server then started again. This time it was better :
csldevparmut05 184.7.121.73:8301 alive server 1.3.1 3 eu-fr-paris
csldevparmut05.eu-fr-paris 184.7.121.73:8322 left server 1.3.1 3 eu-fr-paris
The bad one stay in « left » state.
This morning (~2 days after) the bad node seems gone gone
csldevparmut05 184.7.121.73:8301 alive build=1.3.1:f2b13f30,dc=eu-fr-paris,expect=5,id=7503ec82-d145-c0be-71dd-2eb133d982d2,port=8300,raft_vsn=3,role=consul,segment=,use_tls=1,vsn=3,vsn_max=3,vsn_min=2
I think if I’m re-enabling again the wan_port on this node it will probably re-create the bad entry.
I don’t know if I should stop all our servers nodes and see what happened. Or maybe stop all servers nodes and try to empty the “node-id” file then restart servers ?
Does anyone already had this kind of issue ? We migrated from 0.9.3 to 1.3.1 few months ago without any problem on Server side until we see this issue. It seems it exist seems last week (around 2019/08/13) :
Another strange behavior, if I’m using the consul members command on a server, this server don’t appear twice.
For example now from csldevparmut02 :
csldevparmut01 192.160.76.22:8301 alive server 1.3.1 3 eu-fr-paris
csldevparmut01.eu-fr-paris 192.160.76.22:8302 alive server 1.3.1 3 eu-fr-paris
csldevparmut02 192.160.76.26:8301 alive server 1.3.1 3 eu-fr-paris
csldevparmut03 192.162.192.87:8301 alive server 1.3.1 3 eu-fr-paris
csldevparmut03.eu-fr-paris 192.162.192.87:8302 alive server 1.3.1 3 eu-fr-paris
csldevparmut04 192.162.192.88:8301 alive server 1.3.1 3 eu-fr-paris
csldevparmut04.eu-fr-paris 192.162.192.88:8302 alive server 1.3.1 3 eu-fr-paris
csldevparmut05 184.7.121.73:8301 alive server 1.3.1 3 eu-fr-paris
From another server csldevparmut03 :
csldevparmut01 192.160.76.22:8301 alive server 1.3.1 3 eu-fr-paris
csldevparmut01.eu-fr-paris 192.160.76.22:8302 alive server 1.3.1 3 eu-fr-paris
csldevparmut02 192.160.76.26:8301 alive server 1.3.1 3 eu-fr-paris
csldevparmut02.eu-fr-paris 192.160.76.26:8302 alive server 1.3.1 3 eu-fr-paris
csldevparmut03 192.162.192.87:8301 alive server 1.3.1 3 eu-fr-paris
csldevparmut04 192.162.192.88:8301 alive server 1.3.1 3 eu-fr-paris
csldevparmut04.eu-fr-paris 192.162.192.88:8302 alive server 1.3.1 3 eu-fr-paris
csldevparmut05 184.7.121.73:8301 alive server 1.3.1 3 eu-fr-paris
It seems not impacting our end users but logs are full of :
2019/08/21 00:02:12 [ERR] memberlist: Failed push/pull merge: Member 'csldevparmut03.eu-fr-paris' has conflicting node ID '63ee5bf0-d59f-880c-09a2-bf1a964f045b' with this agent's ID from=192.163.223.201:42842
From csldevparmut03 first occurrences :
2019/08/13 10:14:08 [WARN] memberlist: ignoring alive message for 'csldevparmut03.eu-fr-paris': Member 'csldevparmut03.eu-fr-paris' has conflicting node ID '63ee5bf0-d59f-880c-09a2-bf1a964f045b' with this agent's ID
2019/08/13 10:14:08 [INFO] serf: EventMemberJoin: csldevparmut02.eu-fr-paris 192.160.76.26
2019/08/13 10:14:08 [INFO] consul: Adding LAN server csldevparmut02.eu-fr-paris (Addr: tcp/192.160.76.26:8300) (DC: eu-fr-paris)
2019/08/13 10:14:08 [WARN] memberlist: ignoring alive message for ‘csldevparmut03.eu-fr-paris’: Member ‘csldevparmut03.eu-fr-paris’ has conflicting node ID ‘63ee5bf0-d59f-880c-09a2-bf1a964f045b’ with this agent’s ID
2019/08/13 10:14:08 [INFO] serf: EventMemberJoin: csldevparmut01.eu-fr-paris 192.160.76.22
2019/08/13 10:14:08 [INFO] serf: EventMemberJoin: csldevparmut05.eu-fr-paris 184.7.121.73
2019/08/13 10:14:08 [INFO] consul: Adding LAN server csldevparmut01.eu-fr-paris (Addr: tcp/192.160.76.22:8300) (DC: eu-fr-paris)
2019/08/13 10:14:08 [INFO] consul: Adding LAN server csldevparmut05.eu-fr-paris (Addr: tcp/184.7.121.73:8300) (DC: eu-fr-paris)
2019/08/13 10:14:08 [WARN] memberlist: ignoring alive message for xxxxxxx: Member xxxxxxx is not a server
2019/08/13 10:14:08 [WARN] memberlist: ignoring alive message for ‘csldevparmut03.eu-fr-paris’: Member ‘csldevparmut03.eu-fr-paris’ has conflicting node ID ‘63ee5bf0-d59f-880c-09a2-bf1a964f045b’ with this agent’s ID
Same on another server (csldevparmut04) :
2019/08/13 10:12:53 [INFO] memberlist: Marking xxxxxx as failed, suspect timeout reached (2 peer confirmations)
2019/08/13 10:12:54 [ERR] memberlist: failed to receive: Encryption is configured but remote state is not encrypted from=192.163.218.127:33830
2019/08/13 10:13:03 [ERR] memberlist: failed to receive: Encryption is configured but remote state is not encrypted from=192.163.201.200:38184
2019/08/13 10:13:11 [INFO] memberlist: Marking xxxxxxxx as failed, suspect timeout reached (2 peer confirmations)
….
2019/08/13 10:13:17 [ERR] memberlist: failed to receive: Encryption is configured but remote state is not encrypted from=192.163.201.199:38720
2019/08/13 10:13:18 [ERR] memberlist: failed to receive: Encryption is configured but remote state is not encrypted from=192.167.1.16:53988
2019/08/13 10:13:24 [ERR] memberlist: failed to receive: Encryption is configured but remote state is not encrypted from=192.163.218.127:33972
2019/08/13 10:13:33 [ERR] memberlist: failed to receive: Encryption is configured but remote state is not encrypted from=192.163.201.200:38284
2019/08/13 10:13:47 [ERR] memberlist: failed to receive: Encryption is configured but remote state is not encrypted from=192.163.201.199:38780
2019/08/13 10:13:48 [ERR] memberlist: failed to receive: Encryption is configured but remote state is not encrypted from=192.167.1.16:54014
2019/08/13 10:13:54 [ERR] memberlist: failed to receive: Encryption is configured but remote state is not encrypted from=192.163.218.127:34114
2019/08/13 10:14:03 [ERR] memberlist: failed to receive: Encryption is configured but remote state is not encrypted from=192.163.201.200:38376
2019/08/13 10:14:07 [ERR] memberlist: Failed push/pull merge: Member ‘xxxxxxx’ is not a server from=192.163.165.233:59484
2019/08/13 10:14:08 [WARN] memberlist: ignoring alive message for xxxxxxx: Member xxxxxxx is not a server
2019/08/13 10:14:08 [WARN] memberlist: ignoring alive message for xxxxxxx: Member xxxxxxx is not a server
2019/08/13 10:14:08 [WARN] memberlist: ignoring alive message for ‘csldevparmut04.eu-fr-paris’: Member ‘csldevparmut04.eu-fr-paris’ has conflicting node ID ‘a7b79e73-666f-f2b3-feb7-77cc4dbacc28’ with this agent’s ID
……
2019/08/13 10:14:08 [WARN] memberlist: ignoring alive message for ‘csldevparmut04.eu-fr-paris’: Member ‘csldevparmut04.eu-fr-paris’ has conflicting node ID ‘a7b79e73-666f-f2b3-feb7-77cc4dbacc28’ with this agent’s ID
2019/08/13 10:14:08 [WARN] memberlist: ignoring alive message for ‘csldevparmut04.eu-fr-paris’: Member ‘csldevparmut04.eu-fr-paris’ has conflicting node ID ‘a7b79e73-666f-f2b3-feb7-77cc4dbacc28’ with this agent’s ID
2019/08/13 10:14:08 [INFO] serf: EventMemberJoin: csldevparmut05.eu-fr-paris 184.7.121.73
2019/08/13 10:14:08 [INFO] serf: EventMemberJoin: csldevparmut03.eu-fr-paris 192.162.192.87
2019/08/13 10:14:08 [INFO] serf: EventMemberJoin: csldevparmut02.eu-fr-paris 192.160.76.26
2019/08/13 10:14:08 [INFO] serf: EventMemberJoin: csldevparmut01.eu-fr-paris 192.160.76.22
2019/08/13 10:14:08 [INFO] consul: Adding LAN server csldevparmut05.eu-fr-paris (Addr: tcp/184.7.121.73:8300) (DC: eu-fr-paris)
2019/08/13 10:14:08 [INFO] consul: Adding LAN server csldevparmut03.eu-fr-paris (Addr: tcp/192.162.192.87:8300) (DC: eu-fr-paris)
2019/08/13 10:14:08 [INFO] consul: Adding LAN server csldevparmut02.eu-fr-paris (Addr: tcp/192.160.76.26:8300) (DC: eu-fr-paris)
2019/08/13 10:14:08 [INFO] consul: Adding LAN server csldevparmut01.eu-fr-paris (Addr: tcp/192.160.76.22:8300) (DC: eu-fr-paris)
2019/08/13 10:14:08 [WARN] memberlist: ignoring alive message for ‘csldevparmut04.eu-fr-paris’: Member ‘csldevparmut04.eu-fr-paris’ has conflicting node ID ‘a7b79e73-666f-f2b3-feb7-77cc4dbacc28’ with this agent’s ID
2019/08/13 10:14:08 [WARN] memberlist: ignoring alive message for xxxxxxx: Member xxxxxxx is not a server
2019/08/13 10:14:08 [WARN] memberlist: ignoring alive message for ‘csldevparmut04.eu-fr-paris’: Member ‘csldevparmut04.eu-fr-paris’ has conflicting node ID ‘a7b79e73-666f-f2b3-feb7-77cc4dbacc28’ with this agent’s ID
…….
2019/08/13 10:14:08 [WARN] memberlist: ignoring alive message for ‘csldevparmut04.eu-fr-paris’: Member ‘csldevparmut04.eu-fr-paris’ has conflicting node ID ‘a7b79e73-666f-f2b3-feb7-77cc4dbacc28’ with this agent’s ID
2019/08/13 10:14:08 [WARN] memberlist: ignoring alive message for ‘csldevparmut04.eu-fr-paris’: Member ‘csldevparmut04.eu-fr-paris’ has conflicting node ID ‘a7b79e73-666f-f2b3-feb7-77cc4dbacc28’ with this agent’s ID
……
If I’m looking for first occurrence with *.eur-fr-paris it seems happening since 2019/08/08 (logs from csldevparmut04) :
2019/08/08 01:08:43 [INFO] memberlist: Suspect csldevparmut05.eu-fr-paris has failed, no acks received
2019/08/08 01:08:43 [INFO] serf: EventMemberJoin: xxxxxxx 172.17.42.1
2019/08/08 01:08:47 [INFO] memberlist: Suspect xxxxxxx has failed, no acks received
2019/08/08 01:08:47 [INFO] serf: EventMemberJoin: astuattt028 172.17.42.1
2019/08/08 01:08:56 [INFO] serf: EventMemberFailed: csldevparmut05.eu-fr-paris 184.7.121.73
2019/08/08 01:08:56 [INFO] consul: Handled member-failed event for server "csldevparmut05.eu-fr-paris" in area "wan"
2019/08/08 01:08:58 [INFO] serf: EventMemberFailed: xxxxxxx 172.17.42.1
2019/08/08 01:08:59 [INFO] serf: EventMemberJoin: xxxxxxx 172.17.42.1
2019/08/08 01:09:03 [INFO] memberlist: Suspect csldevparmut05.eu-fr-paris has failed, no acks received
2019/08/08 01:09:03 [ERR] memberlist: failed to receive: Encryption is configured but remote state is not encrypted from=192.163.218.127:51674
2019/08/08 01:09:04 [INFO] memberlist: Marking xxxxxxx as failed, suspect timeout reached (2 peer confirmations)
2019/08/08 01:09:04 [INFO] serf: EventMemberFailed: xxxxxxx 172.17.42.1
2019/08/08 01:09:05 [INFO] serf: EventMemberJoin: xxxxxxx 172.17.42.1
2019/08/08 01:09:14 [ERR] memberlist: failed to receive: Encryption is configured but remote state is not encrypted from=192.163.201.200:55648
2019/08/08 01:09:16 [ERR] memberlist: failed to receive: Encryption is configured but remote state is not encrypted from=192.167.1.16:49408
2019/08/08 01:09:19 [INFO] serf: attempting reconnect to csldevparmut05.eu-fr-paris 184.7.121.73:8302
If I’m going on server csldevparmut05 I can see during this timeslot a restart on Consul logs :
2019/08/08 01:08:10 [INFO] serf: EventMemberFailed: xxxxxxx 172.17.42.1
2019/08/08 01:08:10 [INFO] serf: EventMemberJoin: xxxxxxx 172.17.42.1
08/08/19 01:08:12 Stopping Consul Agent
2019/08/08 01:08:14 [INFO] agent: Caught signal: terminated
2019/08/08 01:08:14 [INFO] agent: Graceful shutdown disabled. Exiting
2019/08/08 01:08:14 [INFO] agent: Requesting shutdown
2019/08/08 01:08:14 [INFO] consul: shutting down server
2019/08/08 01:08:14 [WARN] serf: Shutdown without a Leave
2019/08/08 01:08:14 [WARN] serf: Shutdown without a Leave
2019/08/08 01:08:14 [INFO] manager: shutting down
2019/08/08 01:08:14 [ERR] raft-net: Failed to decode incoming command: transport shutdown
2019/08/08 01:08:14 [INFO] agent: consul server down
2019/08/08 01:08:14 [INFO] agent: shutdown complete
2019/08/08 01:08:14 [INFO] agent: Stopping DNS server 127.0.0.1:8600 (tcp)
2019/08/08 01:08:14 [INFO] agent: Stopping DNS server 127.0.0.1:8600 (udp)
2019/08/08 01:08:14 [INFO] agent: Stopping HTTPS server 127.0.0.1:8543 (tcp)
2019/08/08 01:08:14 [INFO] agent: Waiting for endpoints to shut down
2019/08/08 01:08:14 [INFO] agent: Endpoints down
2019/08/08 01:08:14 [INFO] agent: Exit code: 1
WARNING: LAN keyring exists but -encrypt given, using keyring
WARNING: WAN keyring exists but -encrypt given, using keyring
bootstrap_expect > 0: expecting 5 servers
==> Starting Consul agent…
==> Consul agent running!
Version: ‘v1.3.1’
Node ID: ‘7503ec82-d145-c0be-71dd-2eb133d982d2’
Node name: ‘csldevparmut05’
Datacenter: ‘eu-fr-paris’ (Segment: ‘’)
Server: true (Bootstrap: false)
Client Addr: [127.0.0.1] (HTTP: -1, HTTPS: 8543, gRPC: -1, DNS: 8600)
Cluster Addr: 184.7.121.73 (LAN: 8301, WAN: 8302)
Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: false
==> Log data will now stream in as it occurs:
2019/08/08 01:09:51 [INFO] raft: Restored from snapshot 1989-522345770-1565218964562
2019/08/08 01:09:53 [INFO] raft: Initial configuration (index=436606669): [{Suffrage:Voter ID:976bf3f9-e9de-2db0-6d78-c8b7fc4245ef Address:192.160.76.22:8300} {Suffrage:Voter ID:9907f5f3-8bc8-830b-ad5a-54210a481635 Address:192.160.76.26:8300} {Suffrage:Voter ID:7503ec82-d145-c0be-71dd-2eb133d982d2 Address:184.7.121.73:8300} {Suffrage:Voter ID:63ee5bf0-d59f-880c-09a2-bf1a964f045b Address:192.162.192.87:8300} {Suffrage:Voter ID:a7b79e73-666f-f2b3-feb7-77cc4dbacc28 Address:192.162.192.88:8300}]
2019/08/08 01:09:53 [INFO] raft: Node at 184.7.121.73:8300 [Follower] entering Follower state (Leader: "")
2019/08/08 01:09:53 [INFO] serf: EventMemberJoin: csldevparmut05.eu-fr-paris 184.7.121.73
2019/08/08 01:09:53 [INFO] serf: Attempting re-join to previously known node: csldevparmut04.eu-fr-paris: 192.162.192.88:8302
2019/08/08 01:09:53 [INFO] serf: EventMemberJoin: csldevparmut04.eu-fr-paris 192.162.192.88
2019/08/08 01:09:53 [INFO] serf: EventMemberJoin: csldevparmut03.eu-fr-paris 192.162.192.87
2019/08/08 01:09:53 [WARN] memberlist: Refuting a suspect message (from: csldevparmut05.eu-fr-paris)
2019/08/08 01:09:53 [INFO] serf: EventMemberJoin: csldevparmut02.eu-fr-paris 192.160.76.26
2019/08/08 01:09:53 [INFO] serf: EventMemberJoin: csldevparmut01.eu-fr-paris 192.160.76.22
2019/08/08 01:09:53 [INFO] serf: Re-joined to previously known node: csldevparmut04.eu-fr-paris: 192.162.192.88:8302
2019/08/08 01:09:53 [INFO] serf: EventMemberJoin: csldevparmut05 184.7.121.73
2019/08/08 01:09:53 [INFO] consul: Adding LAN server csldevparmut05 (Addr: tcp/184.7.121.73:8300) (DC: eu-fr-paris)
2019/08/08 01:09:53 [INFO] consul: Raft data found, disabling bootstrap mode
2019/08/08 01:09:53 [INFO] consul: Handled member-join event for server "csldevparmut05.eu-fr-paris" in area "wan"
2019/08/08 01:09:53 [INFO] consul: Handled member-join event for server "csldevparmut04.eu-fr-paris" in area "wan"
2019/08/08 01:09:53 [INFO] consul: Handled member-join event for server "csldevparmut03.eu-fr-paris" in area "wan"
2019/08/08 01:09:53 [INFO] serf: EventMemberJoin: xxxxxxx 184.14.66.14 2019/08/08 01:09:53 [INFO] consul: Handled member-join event for server "csldevparmut02.eu-fr-paris" in area "wan"
2019/08/08 01:09:53 [INFO] consul: Handled member-join event for server "csldevparmut01.eu-fr-paris" in area "wan"
It seems this server was shut down during this timeslot :
shutdown system down 3.10.0-693.11.6. Thu Aug 8 01:08 - 03:08 (02:00)
//
Aug 8 01:07:02 barneyross yum[56162]: Updated: zabbix-agent-4.0.4-1.el7.x86_64
Aug 8 01:07:07 barneyross subscription-manager: Added subscription for ‘Content Access’ contract ‘None’
Aug 8 01:07:07 barneyross subscription-manager: Added subscription for product ’ Content Access’
Aug 8 01:08:53 barneyross rsyslogd: [origin software=“rsyslogd” swVersion=“8.24.0-34.el7” x-pid=“9013” x-info=“http://www.rsyslog.com”] start
Aug 8 01:08:53 barneyross rsyslogd: action ‘’ treated as ':omusrmsg:’ - please use ‘:omusrmsg:’ syntax instead, '’ will not be supported in the future [v8.24.0-34.el7 try http://www.rsyslog.com/e/2184 ]
Aug 8 01:08:53 barneyross kernel: [ 0.000000] Initializing cgroup subsys cpuset
Aug 8 01:08:53 barneyross rsyslogd: error during parsing file /etc/rsyslog.conf, on or before line 27: warnings occured in file ‘/etc/rsyslog.conf’ around line 27 [v8.24.0-34.el7 try http://www.rsyslog.com/e/2207 ]
Aug 8 01:08:53 barneyross kernel: [ 0.000000] Initializing cgroup subsys cpu
Aug 8 01:08:53 barneyross kernel: [ 0.000000] Initializing cgroup subsys cpuacct
Aug 8 01:08:53 barneyross kernel: [ 0.000000] Linux version 3.10.0-957.10.1.el7.x86_64 (mockbuild@x86-040.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Thu Feb 7 07:12:53 UTC 2019
I guess it’s related to this reboot/crash. Seems everything started during this timeslot.
On the Consul logs the “…in area “wan”” seems starting during this timeslot :
2019/08/08 01:09:53 [INFO] consul: Handled member-join event for server “csldevparmut05.eu-fr-paris” in area “wan”
Didn’t see any previous similar message before after I check on all servers. Some processes are started since many weeks (like May).
Thanks for your help,
Regards
Laurent