Unable to pass Nomad Server HTTP Check in Consul

Greetings earthlings,

I am struggling to terraform a Nomad cluster in my AWS environment as the Nomad Server HTTP Check in Consul doesn’t seem to pass no matter what I do.

Here are the cornerstones of it all:

  • relying on Consul agent for cluster formation and service discovery
  • terraformed via official Gruntworks maintained modules (somewhat adapted)
  • Amazon Linux 2 based custom Packer-ed AMI
  • Nomad 0.12.4
  • Consul 1.8.3

As for Nomad Config:

region     = "europe"
datacenter = "aws15-${stage}"
data_dir  = "/opt/nomad/data/"
disable_update_check = true
# bind_addr = "$PRIVATE_IP"

addresses {
  http = "0.0.0.0"
}

advertise {
  http = "{{ GetInterfaceIP \"eth0\" }}"
  rpc  = "{{ GetInterfaceIP \"eth0\" }}"
  serf = "{{ GetInterfaceIP \"eth0\" }}"
}

server {
  enabled          = true
  bootstrap_expect = 3
  encrypt          = "REDACTED"
}

plugin "raw_exec" {
  config {
    enabled = true
  }
}

leave_on_terminate = true
leave_on_interrupt = true

Consul client being:

datacenter     = "aws15-mgmt"

ports {
  grpc = 8502
}

data_dir             = "/opt/consul/data"
disable_update_check = true
leave_on_terminate   = true

### Cloud autodiscovery section ###
retry_join = [
  "provider=aws tag_key=consul-servers tag_value=consul-mgmt addr_type=private_v4 region=eu-central-1"
  ]

### Encryption ###
encrypt = "REDACTED"
encrypt_verify_incoming = false
encrypt_verify_outgoing = true
verify_incoming = false
verify_outgoing = true
verify_server_hostname = true
ca_file = "consul-agent-ca.pem"
auto_encrypt {
  tls = true
}

### Disable server mode ###
server        = false
raft_protocol = 3

### Enable central configuration ###
enable_central_service_config = true

This is how it looks in the Consul Server GUI:
grafik

Here’s some logging output from one node:

nomad monitor
2020-09-10T16:09:04.556+0200 [INFO]  nomad: successfully contacted Nomad servers: num_servers=2
2020-09-10T16:09:45.936+0200 [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
2020-09-10T16:09:45.936+0200 [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
2020-09-10T16:09:47.043+0200 [WARN]  nomad.raft: rejecting vote request since we have a leader: from=172.18.131.221:4647 leader=172.18.131.201:4647
2020-09-10T16:09:47.223+0200 [INFO]  nomad: serf: EventMemberLeave: mgmt-nomad-t2c.europe 172.18.131.201
2020-09-10T16:09:47.223+0200 [INFO]  nomad: removing server: server="mgmt-nomad-t2c.europe (Addr: 172.18.131.201:4647) (DC: aws15-mgmt)"
2020-09-10T16:09:48.417+0200 [WARN]  nomad.raft: rejecting vote request since we have a leader: from=172.18.131.221:4647 leader=172.18.131.201:4647
2020-09-10T16:09:48.719+0200 [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader=172.18.131.201:4647
2020-09-10T16:09:48.719+0200 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.131.96:4647 [Candidate]" term=30
2020-09-10T16:09:48.723+0200 [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
2020-09-10T16:09:48.723+0200 [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
2020-09-10T16:09:48.724+0200 [INFO]  nomad.raft: entering follower state: follower="Node at 172.18.131.96:4647 [Follower]" leader=
2020-09-10T16:09:50.757+0200 [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
2020-09-10T16:09:50.758+0200 [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
2020-09-10T16:09:51.001+0200 [INFO]  nomad: serf: EventMemberJoin: mgmt-nomad-t2c.europe 172.18.131.201
2020-09-10T16:09:51.001+0200 [INFO]  nomad: adding server: server="mgmt-nomad-t2c.europe (Addr: 172.18.131.201:4647) (DC: aws15-mgmt)"
2020-09-10T16:09:51.975+0200 [INFO]  nomad: serf: EventMemberFailed: mgmt-nomad-7tn.europe 172.18.131.221
2020-09-10T16:09:51.975+0200 [INFO]  nomad: removing server: server="mgmt-nomad-7tn.europe (Addr: 172.18.131.221:4647) (DC: aws15-mgmt)"
2020-09-10T16:09:52.667+0200 [WARN]  nomad.raft: rejecting vote request since we have a leader: from=172.18.131.201:4647 leader=172.18.131.221:4647
2020-09-10T16:09:53.541+0200 [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
2020-09-10T16:09:53.541+0200 [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
2020-09-10T16:09:53.580+0200 [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader=172.18.131.221:4647
2020-09-10T16:09:53.580+0200 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.131.96:4647 [Candidate]" term=33
2020-09-10T16:09:53.584+0200 [INFO]  nomad.raft: election won: tally=1
2020-09-10T16:09:53.584+0200 [INFO]  nomad.raft: entering leader state: leader="Node at 172.18.131.96:4647 [Leader]"
2020-09-10T16:09:53.584+0200 [INFO]  nomad: cluster leadership acquired
2020-09-10T16:09:53.587+0200 [INFO]  nomad.raft: updating configuration: command=AddStaging server-id=172.18.131.201:4647 server-addr=172.18.131.201:4647 servers="[{Suffrage:Voter ID:172.18.131.96:4647 Address:172.18.131.96:4647} {Suffrage:Voter ID:172.18.131.201:4647 Address:172.18.131.201:4647}]"
2020-09-10T16:09:53.588+0200 [INFO]  nomad.raft: added peer, starting replication: peer=172.18.131.201:4647
2020-09-10T16:09:53.592+0200 [WARN]  nomad.raft: appendEntries rejected, sending older logs: peer="{Voter 172.18.131.201:4647 172.18.131.201:4647}" next=83
2020-09-10T16:09:53.595+0200 [INFO]  nomad.raft: pipelining replication: peer="{Voter 172.18.131.201:4647 172.18.131.201:4647}"
2020-09-10T16:09:55.823+0200 [INFO]  nomad: serf: EventMemberJoin: mgmt-nomad-7tn.europe 172.18.131.221
2020-09-10T16:09:55.823+0200 [INFO]  nomad: adding server: server="mgmt-nomad-7tn.europe (Addr: 172.18.131.221:4647) (DC: aws15-mgmt)"
2020-09-10T16:09:55.823+0200 [INFO]  nomad.raft: updating configuration: command=AddStaging server-id=172.18.131.221:4647 server-addr=172.18.131.221:4647 servers="[{Suffrage:Voter ID:172.18.131.96:4647 Address:172.18.131.96:4647} {Suffrage:Voter ID:172.18.131.201:4647 Address:172.18.131.201:4647} {Suffrage:Voter ID:172.18.131.221:4647 Address:172.18.131.221:4647}]"
2020-09-10T16:09:55.825+0200 [INFO]  nomad.raft: added peer, starting replication: peer=172.18.131.221:4647
2020-09-10T16:09:55.825+0200 [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 172.18.131.221:4647 172.18.131.221:4647}" error=EOF
2020-09-10T16:09:56.296+0200 [WARN]  nomad.raft: appendEntries rejected, sending older logs: peer="{Voter 172.18.131.221:4647 172.18.131.221:4647}" next=86
2020-09-10T16:09:56.299+0200 [INFO]  nomad.raft: pipelining replication: peer="{Voter 172.18.131.221:4647 172.18.131.221:4647}"
2020-09-10T16:10:01.669+0200 [INFO]  nomad: server starting leave
2020-09-10T16:10:01.669+0200 [INFO]  nomad.raft: updating configuration: command=RemoveServer server-id=172.18.131.96:4647 server-addr= servers="[{Suffrage:Voter ID:172.18.131.201:4647 Address:172.18.131.201:4647} {Suffrage:Voter ID:172.18.131.221:4647 Address:172.18.131.221:4647}]"
2020-09-10T16:10:01.673+0200 [INFO]  nomad.raft: removed ourself, transitioning to follower
2020-09-10T16:10:01.673+0200 [INFO]  nomad.raft: entering follower state: follower="Node at 172.18.131.96:4647 [Follower]" leader=
2020-09-10T16:10:01.674+0200 [INFO]  nomad.raft: aborting pipeline replication: peer="{Voter 172.18.131.201:4647 172.18.131.201:4647}"
2020-09-10T16:10:01.675+0200 [INFO]  nomad: cluster leadership lost
2020-09-10T16:10:01.675+0200 [ERROR] worker: failed to dequeue evaluation: error="eval broker disabled"
2020-09-10T16:10:01.675+0200 [INFO]  nomad.raft: aborting pipeline replication: peer="{Voter 172.18.131.221:4647 172.18.131.221:4647}"
2020-09-10T16:10:02.466+0200 [INFO]  nomad: serf: EventMemberLeave: mgmt-nomad-7jt.europe 172.18.131.96
2020-09-10T16:10:02.466+0200 [INFO]  nomad: removing server: server="mgmt-nomad-7jt.europe (Addr: 172.18.131.96:4647) (DC: aws15-mgmt)"
2020-09-10T16:10:03.261+0200 [WARN]  nomad.raft: not part of stable configuration, aborting election

Any help is appreciated!

Cheers
Ralph

Hi Ralph,

did you ever solved your issue?
The same is happening for my setup with a simple nomad config that uses an non localhost advertise address.

Hi everyone :wave:

The checks are failing because the cluster can’t establish leadership. Are all servers able to reach each other? You could try using netcat or some other tool to check this.