Consul client not joining server after outage

Hi All,

recently we had a complete outage in consul servers , we have created new cluster and restored it from backup , however we saw consul client which had registered DB services were no longer available in consul UI , and we ended up restarting consul clients in each DB instances to register the service , we have below consul client configurations , we expected the client shoudl retry join and register the service , however it didnt happen can we know if are missing any configuration here. it is very important to do go live activity. please help

{
“server”: false,
“data_dir”: “/opt/consul”,
“log_level”: “INFO”,
“enable_syslog”: true,
“datacenter”: “us-east4”,
“enable_debug”: true,
“log_level”: “INFO”,
“enable_syslog”: true,
“enable_debug”: true,
“retry_join”: [“provider=gce project_name= tag_value=consul-cluster-tag”]
}

Do you have any log output?

Hi ,

sorry for delayed response ! below is implementation details.
a. consul template service is running to fetch the configs from consul server
b. consul client is up and running , however it has pre-req consul template service to be running
c. kafka or other DB’s are registering to consul via consul client

during Issue :
a. consul client service was up and running , but couldnt able to register service or join consul server when outage was recovered

please find below logs for the same.

Jan 22 06:58:53 sskafkacontrolcenterdev-dev531-v000-r416 consul-template[21034]: 2020/01/22 06:58:53.494266 [WARN] (view) kv.block(kafka/ss-kafka-control-center-dev/BROKERS): Unexpected res
Jan 22 06:58:53 sskafkacontrolcenterdev-dev531-v000-r416 consul-template[21034]: 2020/01/22 06:58:53.983868 [WARN] (view) health.service(ss-zk-carbon-dev|passing): Unexpected response code:
Jan 22 06:59:47 sskafkacontrolcenterdev-dev531-v000-r416 consul-template[21034]: 2020/01/22 06:59:47.215667 [ERR] (view) health.service(ss-kafka-npe-dev|passing): Unexpected response code:
Jan 22 06:59:47 sskafkacontrolcenterdev-dev531-v000-r416 consul-template[21034]: 2020/01/22 06:59:47.215693 [ERR] (runner) watcher reported error: health.service(ss-kafka-npe-dev|passing):
Jan 22 06:59:47 sskafkacontrolcenterdev-dev531-v000-r416 consul-template[21034]: 2020/01/22 06:59:47.215747 [ERR] (cli) health.service(ss-kafka-npe-dev|passing): Unexpected response code: 5
Jan 22 06:59:47 sskafkacontrolcenterdev-dev531-v000-r416 systemd[1]: consul-template.service: main process exited, code=exited, status=14/n/a
Jan 22 06:59:47 sskafkacontrolcenterdev-dev531-v000-r416 systemd[1]: Unit consul-template.service entered failed state.
Jan 22 06:59:47 sskafkacontrolcenterdev-dev531-v000-r416 systemd[1]: consul-template.service failed.
Jan 22 09:06:48 sskafkacontrolcenterdev-dev531-v000-r416 systemd[1]: Starting Consul-Template Daemon…

2020/01/22 06:55:26 [INFO] serf: attempting reconnect to consulserver-npe-v003-01hh 10.148.1.225:8301
Jan 22 06:55:28 sskafkacontrolcenterdev-dev531-v000-r416 consul[16777]: 2020/01/22 06:55:28 [WARN] manager: No servers available
Jan 22 06:55:28 sskafkacontrolcenterdev-dev531-v000-r416 consul[16777]: manager: No servers available
Jan 22 06:55:28 sskafkacontrolcenterdev-dev531-v000-r416 consul[16777]: 2020/01/22 06:55:28 [ERR] agent: Coordinate update error: No known Consul servers
Jan 22 06:55:28 sskafkacontrolcenterdev-dev531-v000-r416 consul[16777]: agent: Coordinate update error: No known Consul servers
Jan 22 06:55:42 sskafkacontrolcenterdev-dev531-v000-r416 consul[16777]: 2020/01/22 06:55:42 [WARN] manager: No servers available
Jan 22 06:55:42 sskafkacontrolcenterdev-dev531-v000-r416 consul[16777]: manager: No servers available
Jan 22 06:55:42 sskafkacontrolcenterdev-dev531-v000-r416 consul[16777]: 2020/01/22 06:55:42 [ERR] dns: rpc error: No known Consul servers
Jan 22 06:55:42 sskafkacontrolcenterdev-dev531-v000-r416 consul[16777]: 2020/01/22 06:55:42 [WARN] manager: No servers available
Jan 22 06:55:42 sskafkacontrolcenterdev-dev531-v000-r416 consul[16777]: 2020/01/22 06:55:42 [ERR] dns: rpc error: No known Consul servers
Jan 22 06:55:42 sskafkacontrolcenterdev-dev531-v000-r416 consul[16777]: dns: rpc error: No known Consul servers
Jan 22 06:55:42 sskafkacontrolcenterdev-dev531-v000-r416 consul[16777]: manager: No servers available
Jan 22 06:55:42 sskafkacontrolcenterdev-dev531-v000-r416 consul[16777]: dns: rpc error: No known Consul servers

Hi,

we see consul maintains ip lists of quorum locally and retry_join would retry only the ip’s listed locally and tries to form a cluster , however when we completly form a new cluster with same network tag , consul clients are unable to discover the new IP’s of consul server.

our understanding is ‘retry_join’ to find new ip’s irrespective of ip’s it has locally and join the servers , can you let us know are we missing anything here.

Thanks,
Srinidhi

Hi @KRISHS68,

Sounds like you are running into the same issue reported in hashicorp/consul#6672. If all of the servers are simultaneously shut down – or as in your case become unavailable due to an outage – clients will not automatically re-join the cluster when the servers become available again. See this comment (https://github.com/hashicorp/consul/issues/6672#issuecomment-571259675) on that issue for a bit more info on the actual behavior of -retry-join.

PR #7078, which was merged a couple of days ago, modified the -retry-join docs so that they reflect the actual behavior of this config option.