Nomad v1.2.3
Consul v.1.11.1
Deployment on-premise, operating system RHEL 7.9
Nomad Server - deployed on a dedicated virtual machine
Nomad Clients - deployed on dedicated virtual machines
Consul Server - deployed on a dedicated virtual machine
Consul Agent - deployed as a Nomad system job
Consul Connect Enabled
Nomad job having the sidecar enabled works perfectly in normal circumstances.
Restarting firewalld on the Nomad Client cause the Consul Agent to not be able to communicate with the sidecar (getting a bunch of connection refused) and all health checks are failing.
Only after restarting the Nomad client service or after restarting the Nomad job things get back to normal.
I followed the guidelines from the official docs on how to install everything (there is 1 Consul server, 1 Nomad Server, multiple Nomad Clients). Nomad is installed as a service.
$ systemctl status nomad
● nomad.service - Nomad
Loaded: loaded (/etc/systemd/system/nomad.service; disabled; vendor preset: disabled)
Active: active (running) since Wed 2022-01-05 13:52:06 CET; 22h ago
Docs: https://www.nomadproject.io/docs/
Main PID: 25884 (nomad)
Tasks: 243
Memory: 263.1M
consul-agent.nomad system job spec
job "consul-agent" {
datacenters = ["dev"]
type = "system"
update {
max_parallel = 1
stagger = "30s"
}
priority = 100
group "consul-agents" {
network {
port "http" {
static = 8500
to = 8500
}
}
service {
port = "http"
name = "consul-agent"
tags = ["consul", "consul-agent", "http"]
check {
type = "http"
path = "/v1/agent/checks"
interval = "10s"
timeout = "5s"
}
}
task "consul-agent" {
driver = "exec"
# Due to the limitations of Consul Connect, where consul binary must be present on Nomad's $PATH, the consul agent
# will not be installed via the artifact stanza, it must already be installed on the host. This task will just
# schedule it and make sure it is running with the proper config.
#artifact {
# source = "https://releases.hashicorp.com/consul/1.10.3/consul_1.10.3_linux_amd64.zip"
#}
config {
command = "consul"
args = [
"agent",
"-bind={{ GetInterfaceIP \"ens192\" }}",
"-datacenter=dev",
"-data-dir=/opt/consul",
"-join=ip_consul_server",
"-grpc-port=8502",
# the addresses to which consul will bind network interfaces, including http and dns, so allowing access on
# both loopback and external connections; this is needed to scrape metrics in Prometheus, which will try to
# access the agent via its external IP.
"-client={{ GetInterfaceIP \"ens192\" }} 127.0.0.1",
# enables built-in web UI server
"-ui"
]
}
resources {
cpu = 300
memory = 256
}
}
}
}
Hi @cvasii. I don’t know why I didn’t think of this yesterday, so apologies. Nomad networking works by manipulating iptables entries. I don’t know firewalld specifically, but my suspicion is that when you restart it, it’s dropping all IP tables rules.
Can you test this theory by listing your iptables config before and after the firewalld restart? I suspect you’ll see that entries are getting dropped, and then after you restart the Nomad client, the allocations get rescheduled, and new entries show up.
If this is the case, I think the solution for you will be to save and reload the iptables entries on firewalld restart at the host layer, which is outside of Nomad’s scope.