Consul agent unable to talk to sidecar after firewalld restart

Nomad v1.2.3
Consul v.1.11.1
Deployment on-premise, operating system RHEL 7.9

Nomad Server - deployed on a dedicated virtual machine
Nomad Clients - deployed on dedicated virtual machines
Consul Server - deployed on a dedicated virtual machine
Consul Agent - deployed as a Nomad system job
Consul Connect Enabled

Nomad job having the sidecar enabled works perfectly in normal circumstances.

Restarting firewalld on the Nomad Client cause the Consul Agent to not be able to communicate with the sidecar (getting a bunch of connection refused) and all health checks are failing.

Only after restarting the Nomad client service or after restarting the Nomad job things get back to normal.

Any suggestions?

Hi @cvasii

Thanks for using Nomad!

Do you mind sharing a few config files so I can make sure I am replicating your environment accurately? Specifically, I’d like

  • Nomad Client config
  • Consul Agent system job jobpec
  • Consul Agent config if it’s external to the jobspec - if it’s an inline template the jobspec will already have it :grinning:

As always, here’s a friendly reminder to remove any secrets before posting config files. :grinning: thanks!

Derek

  • /etc/nomad.d/nomad.hcl
# Full configuration options can be found at https://www.nomadproject.io/docs/configuration

datacenter = "dev"
data_dir = "/opt/nomad"
bind_addr = "0.0.0.0"
#enable_debug = true

# Configure logging with log files rotation every 10MB
log_file = "/var/log/nomad/nomad.log"
log_rotate_bytes = 10000000
log_rotate_duration = "24h"
log_rotate_max_files = 5

telemetry {
        collection_interval = "60s"
        disable_hostname = true
        prometheus_metrics = true
        publish_allocation_metrics = true
        publish_node_metrics = true
}
  • /etc/nomad.d/client.hcl
client {
  enabled = true
  servers = ["server_ip_address"]
  options = {
     "user.denylist" = ""
     "docker.volumes.enabled" = "true"
     "docker.privileged.enabled" = "true"
  }
  network_interface = "ens192"
}

vault {
        enabled = true
        address = "https://vault.my_internal_dns"
        tls_skip_verify = true
}
  • I followed the guidelines from the official docs on how to install everything (there is 1 Consul server, 1 Nomad Server, multiple Nomad Clients). Nomad is installed as a service.
$ systemctl status nomad
● nomad.service - Nomad
   Loaded: loaded (/etc/systemd/system/nomad.service; disabled; vendor preset: disabled)
   Active: active (running) since Wed 2022-01-05 13:52:06 CET; 22h ago
     Docs: https://www.nomadproject.io/docs/
 Main PID: 25884 (nomad)
    Tasks: 243
   Memory: 263.1M
  • consul-agent.nomad system job spec
job "consul-agent" {

  datacenters = ["dev"]

  type = "system"

  update {
    max_parallel = 1
    stagger      = "30s"
  }

  priority = 100

  group "consul-agents" {

    network {
      port "http" {
        static = 8500
        to     = 8500
      }
    }

    service {
      port = "http"
      name = "consul-agent"
      tags = ["consul", "consul-agent", "http"]

      check {
        type     = "http"
        path     = "/v1/agent/checks"
        interval = "10s"
        timeout  = "5s"
      }

    }

    task "consul-agent" {

      driver = "exec"

      # Due to the limitations of Consul Connect, where consul binary must be present on Nomad's $PATH, the consul agent
      # will not be installed via the artifact stanza, it must already be installed on the host. This task will just
      # schedule it and make sure it is running with the proper config.
      #artifact {
      #  source = "https://releases.hashicorp.com/consul/1.10.3/consul_1.10.3_linux_amd64.zip"
      #}

      config {
        command = "consul"
        args    = [
          "agent",
          "-bind={{ GetInterfaceIP \"ens192\" }}",
          "-datacenter=dev",
          "-data-dir=/opt/consul",
          "-join=ip_consul_server",
          "-grpc-port=8502",
          # the addresses to which consul will bind network interfaces, including http and dns, so allowing access on
          # both loopback and external connections; this is needed to scrape metrics in Prometheus, which will try to
          # access the agent via its external IP.
          "-client={{ GetInterfaceIP \"ens192\" }} 127.0.0.1",
          # enables built-in web UI server
          "-ui"
        ]
      }

      resources {
        cpu    = 300
        memory = 256
      }
    }
  }
}

Hi @cvasii. I don’t know why I didn’t think of this yesterday, so apologies. Nomad networking works by manipulating iptables entries. I don’t know firewalld specifically, but my suspicion is that when you restart it, it’s dropping all IP tables rules.

Can you test this theory by listing your iptables config before and after the firewalld restart? I suspect you’ll see that entries are getting dropped, and then after you restart the Nomad client, the allocations get rescheduled, and new entries show up.

If this is the case, I think the solution for you will be to save and reload the iptables entries on firewalld restart at the host layer, which is outside of Nomad’s scope.