Consul service unable to start

Hi,
I am new to Hashicorp and having this issue where my consul just stops working and when I try to restart the service it fails. Below is what I get when restart the service

# systemctl status consul.service
● consul.service - Consul server agent
   Loaded: loaded (/etc/systemd/system/consul.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Thu 2019-06-20 17:02:48 EDT; 10s ago
  Process: 20054 ExecStart=/opt/forgerock/consul/consul agent -config-file=/opt/forgerock/consul/consul.json -pid-file=/opt/forgerock/consul/consul.pid (code=exited, status=1/FAILURE)
 Main PID: 20054 (code=exited, status=1/FAILURE)

Jun 20 17:02:48 server1.xxxxxx.ontario.ca systemd[1]: Unit consul.service entered failed state.
Jun 20 17:02:48 server1.xxxxxx.ontario.ca systemd[1]: consul.service failed.

My systemd config as below

### BEGIN INIT INFO
# Provides:          consul
# Required-Start:    $local_fs $remote_fs
# Required-Stop:     $local_fs $remote_fs
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Consul agent
# Description:       Consul service discovery framework
### END INIT INFO

[Unit]
Description=Consul server agent
Requires=network-online.target
After=network-online.target

[Service]
User=fruser
Group=fruser
PIDFile=/opt/forgerock/consul/consul.pid
PermissionsStartOnly=true
ExecStart=/opt/forgerock/consul/consul agent -config-file=/opt/forgerock/consul/consul.json -pid-file=/opt/forgerock/consul/consul.pid
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
KillSignal=SIGTERM
Restart=on-failure
RestartSec=42s

[Install]
WantedBy=multi-user.target

I was provided with the below response from the GitHub Group

one thing I can see you may want to try would be starting consul as the fruser yourself with the same arguments that systemd would pass to it. The terminal output is probably telling you whats happening. Also, its odd that the systemctl command isn’t outputting anything that consul would have output. You could try running something like journalctl -xe -u consul to look for the consul specific logs (assumes you are also using journald beside systemd).

Output from journalctl -xe -u consul

– Unit consul.service has finished starting up.

– The start-up result is done.
Jun 21 13:06:14 server1.ontario.ca consul[29603]: bootstrap_expect > 0: expecting 3 servers
Jun 21 13:06:14 server1.ontario.ca consul[29603]: agent: Node name “consul_s1_server1” will not be discovera
Jun 21 13:06:14 server1.ontario.ca consul[29603]: ==> Starting Consul agent…
Jun 21 13:06:14 server1.ontario.ca consul[29603]: raft: Restored from snapshot 24-934063-1554813935074
Jun 21 13:06:14 server1.ontario.ca consul[29603]: raft: Initial configuration (index=884074): [{Suffrage:Vot
Jun 21 13:06:14 server1.ontario.ca consul[29603]: raft: Node at 10.x.x.x:8300 [Follower] entering Follower
Jun 21 13:06:14 server1.ontario.ca consul[29603]: serf: EventMemberJoin: consul_s1_server1.toronto 10.x.x.x
Jun 21 13:06:14 server1.ontario.ca consul[29603]: serf: Failed to re-join any previously known node
Jun 21 13:06:14 server1.ontario.ca consul[29603]: serf: EventMemberJoin: consul_s1_server1 10.x.x.x
Jun 21 13:06:14 server1.ontario.ca consul[29603]: consul: Adding LAN server consul_s1_server1 (Addr: tcp/10.
Jun 21 13:06:14 server1.ontario.ca consul[29603]: consul: Raft data found, disabling bootstrap mode
Jun 21 13:06:14 server1.ontario.ca consul[29603]: serf: Failed to re-join any previously known node
Jun 21 13:06:14 server1.ontario.ca systemd[1]: consul.service: main process exited, code=exited, status=1/FA
Jun 21 13:06:14 server1.ontario.ca systemd[1]: Unit consul.service entered failed state.
Jun 21 13:06:14 server1.ontario.ca systemd[1]: consul.service failed.
Jun 21 13:06:56 server1.ontario.ca systemd[1]: consul.service holdoff time over, scheduling restart.
Jun 21 13:06:56 server1.ontario.ca systemd[1]: Stopped Consul server agent.
– Subject: Unit consul.service has finished shutting down
– Defined-By: systemd
– Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

– Unit consul.service has finished shutting down.
Jun 21 13:06:56 server1.ontario.ca systemd[1]: Started Consul server agent.
– Subject: Unit consul.service has finished start-up
– Defined-By: systemd
– Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

– Unit consul.service has finished starting up.

– The start-up result is done.
Jun 21 13:06:56 server1.ontario.ca consul[29664]: bootstrap_expect > 0: expecting 3 servers
Jun 21 13:06:56 server1.ontario.ca consul[29664]: ==> Starting Consul agent…
Jun 21 13:06:56 server1.ontario.ca consul[29664]: agent: Node name “consul_s1_server1” will not be discovera
Jun 21 13:06:56 server1.ontario.ca consul[29664]: raft: Restored from snapshot 24-934063-1554813935074
Jun 21 13:06:56 server1.ontario.ca consul[29664]: raft: Initial configuration (index=884074): [{Suffrage:Vot
Jun 21 13:06:56 server1.ontario.ca consul[29664]: raft: Node at 10.x.x.x:8300 [Follower] entering Follower
Jun 21 13:06:56 server1.ontario.ca consul[29664]: serf: EventMemberJoin: consul_s1_server1.toronto 10.x.x.x
Jun 21 13:06:56 server1.ontario.ca consul[29664]: serf: Failed to re-join any previously known node
Jun 21 13:06:56 server1.ontario.ca consul[29664]: serf: EventMemberJoin: consul_s1_server1 10.x.x.x
Jun 21 13:06:56 server1.ontario.ca consul[29664]: consul: Adding LAN server consul_s1_server1 (Addr: tcp/10.
Jun 21 13:06:56 server1.ontario.ca consul[29664]: consul: Raft data found, disabling bootstrap mode
Jun 21 13:06:56 server1.ontario.ca consul[29664]: serf: Failed to re-join any previously known node
Jun 21 13:06:56 server1.ontario.ca systemd[1]: consul.service: main process exited, code=exited, status=1/FA
Jun 21 13:06:56 server1.ontario.ca systemd[1]: Unit consul.service entered failed state.
Jun 21 13:06:56 server1.ontario.ca systemd[1]: consul.service failed.

looking forward to hear from you soon.

Your help in this regards will be highly appreciated.

Thanks,

it would be good to see the config file, with all the secret info removed.

there seems to a small clue in the msgs you have posted about the bootstrap_expect parameter.

it would be better to explain what configuration, how many servers, etc are you trying to setup.

Hi @shantanugadgil
I am trying to setup HA vault environment, using the following article.

I have 4 consul servers and 2 vault servers with consult agents installed on them.

My Consul conf files is as below;

> Server 1
{
  "server": true,
  "node_name": "consul_s1_server1",
  "datacenter": "Toronto",
  "data_dir": "/opt/xxxxxxx/consul/data",
  "bind_addr": "0.0.0.0",
  "client_addr": "0.0.0.0",
  "advertise_addr": "server1 IP",
  "bootstrap_expect": 3,
  "retry_join": ["Server1 IP", "Server2 IP", "Server3 IP", "Server4 IP"],
  "ui": true,
  "log_level": "DEBUG",
  "enable_syslog": true,
  "acl_enforce_version_8": false
}

Consul agent config on the vault1 server

{
  "server": false,
  "node_name": "consul_c1_vaultserver1",
  "datacenter": "Toronto",
  "data_dir": "/opt/xxxxxxx/consul/data",
  "bind_addr": "vault server 1 IP",
  "client_addr": "127.0.0.1",
  "retry_join": ["Server1 IP", "Server2 IP", "Server3 IP", "Server4 IP"],
  "log_level": "DEBUG",
  "enable_syslog": true,
  "acl_enforce_version_8": false
}

Another thing that I have experienced is that the consul service on this server stops working by it self and then start working again automatically. Without me changing anything.

Also when I start the consul in Dev mode it get this

[user@server1 consul]$ ./consul agent -dev

==> Starting Consul agent…

==> Error starting agent: 1 error(s) occurred:

@JawadKM Two things I have noticed from your configuration:

1 - You are using 4 servers. Normally either 3 or 5 would be used due to how the Raft protocol works. With 3 servers, there is still consensus if 2 are alive and logs are replicated to both. With 5 servers you can lose 2 and still have a majority of the nodes agree the replicated state. If you have 4 servers, your cluster is going to act similarly to a 5 node cluster that lost 1 node. You can still only lose 1 server before writes and reads are going to start failing. This will not cause the other issues you are seeing but its something you may want to think about.

2 - You have the retry_join configuration setup to include the Server1 IP in that servers configuration. I don’t know for certain that this will cause problems but in general you should not join/retry join a node to itself. I wouldn’t be surprised if this was the root problem.

Also the “Failed to re-join” messages happen when the cluster starts up but nothing it was connected to previously can be connected to now. Looking through the code a bit it looks like for the initial joining consul/serf/memberlist will allow you to join yourself but when restarting, you will not be able to re-join yourself as there is a bit of code there to prevent it.

So I think a good first step is to not include the address of the node in the retry-join configuration for that node. Note that you may need to remove the <consul data dir>/serf/local.snapshot file to get rid of the re-join failure messages after you fixup the retry join arguments.

@mkeeler I appreciate your insight and make the necessary changes today and let you know.

Just so you know I have the same configuration on other servers and they do not give me this issue. Will try your suggestion and update you soon.

Thanks

@mkeeler I did what was advised, “Failed to re-join” message is gone as shown below but still unable to start consul service.

– The start-up result is done.
Jun 27 11:52:25 server1.ontario.ca systemd[1]: consul.service: main process exited, code=exited, status=1/FAILURE
Jun 27 11:52:25 server1.ontario.ca systemd[1]: Unit consul.service entered failed state.
Jun 27 11:52:25 server1.ontario.ca systemd[1]: consul.service failed.
Jun 27 11:53:07 server1.ontario.ca systemd[1]: consul.service holdoff time over, scheduling restart.
Jun 27 11:53:07 server1.ontario.ca systemd[1]: Stopped Consul server agent.
– Subject: Unit consul.service has finished shutting down
– Defined-By: systemd
– Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

– Unit consul.service has finished shutting down.
Jun 27 11:53:07 server1.ontario.ca systemd[1]: Started Consul server agent.
– Subject: Unit consul.service has finished start-up
– Defined-By: systemd
– Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

As I am getting this;

./consul agent -dev

==> Starting Consul agent…
==> Error starting agent: 1 error(s) occurred:

  • listen tcp 127.0.0.1:8600: bind: address already in use

To my understanding Consul is trying to use Port 8600 to start, is there a way that I can configure the consul to use some other port e.g. 8601
If there is a way can you guide me where and how to change/define the port?

Latest Update on consul “journalctl -xe -u consul”

– The start-up result is done.
Jun 27 12:14:30 server1.ontario.ca consul[31509]: bootstrap_expect > 0: expecting 3 servers
Jun 27 12:14:30 server1.ontario.ca consul[31509]: ==> Starting Consul agent…
Jun 27 12:14:30 server1.ontario.ca consul[31509]: agent: Node name “consul_s1_server1” will not be discoverable via DNS due to invalid characters. Valid characters include all a
Jun 27 12:14:30 server1.ontario.ca consul[31509]: raft: Restored from snapshot 24-934063-1554813935074
Jun 27 12:14:31 server1.ontario.ca consul[31509]: raft: Initial configuration (index=884074): [{Suffrage:Voter ID:50dd5200-19ef-12e8-7081-a0a21749c850 Address:Server1 IP:8300} {S
Jun 27 12:14:31 server1.ontario.ca consul[31509]: raft: Node at Server 1 IP:8300 [Follower] entering Follower state (Leader: “”)
Jun 27 12:14:31 server1.ontario.ca consul[31509]: serf: EventMemberJoin: consul_s1_server1.toronto Server1 IP
Jun 27 12:14:31 server1.ontario.ca consul[31509]: serf: Failed to re-join any previously known node
Jun 27 12:14:31 server1.ontario.ca consul[31509]: serf: EventMemberJoin: consul_s1_server1 Server 1 IP
Jun 27 12:14:31 server1.ontario.ca consul[31509]: ==> Error starting agent: 1 error(s) occurred:
Jun 27 12:14:31 server1.ontario.ca consul[31509]: * listen tcp 0.0.0.0:8600: bind: address already in use
Jun 27 12:14:31 server1.ontario.ca systemd[1]: consul.service: main process exited, code=exited, status=1/FAILURE
Jun 27 12:14:31 server1.ontario.ca systemd[1]: Unit consul.service entered failed state.
Jun 27 12:14:31 server1.ontario.ca systemd[1]: consul.service failed.
Jun 27 12:15:13 server1.ontario.ca systemd[1]: consul.service holdoff time over, scheduling restart.
Jun 27 12:15:13 server1.ontario.ca systemd[1]: Stopped Consul server agent.
– Subject: Unit consul.service has finished shutting down
– Defined-By: systemd
– Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

– Unit consul.service has finished shutting down.
Jun 27 12:15:13 server1.ontario.ca systemd[1]: Started Consul server agent.
– Subject: Unit consul.service has finished start-up
– Defined-By: systemd
– Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

– Unit consul.service has finished starting up.

You can change the DNS port (which defaults to 8600) with a ports.dns configuration entry.

Although, you should probably make sure another Consul Agent isn’t running on the system for some reason as its unusual that port 8600 is already in use. Assuming its something else then all the ports Consul uses are configurable so you can prevent the conflicts.

@mkeeler can you give an example on how to define a different DNS port in my .conf file, keeping in view my consul .conf file.
I think I am making a mistake while defining a new port for DNS.

{
  "server": true,
  "node_name": "consul_s1_server1",
  "datacenter": "Toronto",
  "data_dir": "/opt/xxxxxxx/consul/data",
  "bind_addr": "0.0.0.0",
  "client_addr": "0.0.0.0",
  "advertise_addr": "server1 IP",
  "bootstrap_expect": 3,
  "retry_join": ["Server1 IP", "Server2 IP", "Server3 IP", "Server4 IP"],
  "ui": true,
  "log_level": "DEBUG",
  "enable_syslog": true,
  "acl_enforce_version_8": false,
  "ports": {
     "dns": 9600
  }
}

Its just that last little bit at the end there that controls the DNS server port.

@mkeeler you are the Man… Finally it worked.
I appreciate all your help and support, Best Support Ever.