Consul agent unable to talk to sidecar after firewalld restart

cvasii · January 5, 2022, 1:49pm

Nomad v1.2.3
Consul v.1.11.1
Deployment on-premise, operating system RHEL 7.9

Nomad Server - deployed on a dedicated virtual machine
Nomad Clients - deployed on dedicated virtual machines
Consul Server - deployed on a dedicated virtual machine
Consul Agent - deployed as a Nomad system job
Consul Connect Enabled

Nomad job having the sidecar enabled works perfectly in normal circumstances.

Restarting firewalld on the Nomad Client cause the Consul Agent to not be able to communicate with the sidecar (getting a bunch of connection refused) and all health checks are failing.

Only after restarting the Nomad client service or after restarting the Nomad job things get back to normal.

Any suggestions?

DerekStrickland · January 6, 2022, 10:48am

Hi @cvasii

Thanks for using Nomad!

Do you mind sharing a few config files so I can make sure I am replicating your environment accurately? Specifically, I’d like

Nomad Client config
Consul Agent system job jobpec
Consul Agent config if it’s external to the jobspec - if it’s an inline template the jobspec will already have it

As always, here’s a friendly reminder to remove any secrets before posting config files. thanks!

Derek

cvasii · January 6, 2022, 11:14am

/etc/nomad.d/nomad.hcl

# Full configuration options can be found at https://www.nomadproject.io/docs/configuration

datacenter = "dev"
data_dir = "/opt/nomad"
bind_addr = "0.0.0.0"
#enable_debug = true

# Configure logging with log files rotation every 10MB
log_file = "/var/log/nomad/nomad.log"
log_rotate_bytes = 10000000
log_rotate_duration = "24h"
log_rotate_max_files = 5

telemetry {
        collection_interval = "60s"
        disable_hostname = true
        prometheus_metrics = true
        publish_allocation_metrics = true
        publish_node_metrics = true
}

/etc/nomad.d/client.hcl

client {
  enabled = true
  servers = ["server_ip_address"]
  options = {
     "user.denylist" = ""
     "docker.volumes.enabled" = "true"
     "docker.privileged.enabled" = "true"
  }
  network_interface = "ens192"
}

vault {
        enabled = true
        address = "https://vault.my_internal_dns"
        tls_skip_verify = true
}

I followed the guidelines from the official docs on how to install everything (there is 1 Consul server, 1 Nomad Server, multiple Nomad Clients). Nomad is installed as a service.

$ systemctl status nomad
● nomad.service - Nomad
   Loaded: loaded (/etc/systemd/system/nomad.service; disabled; vendor preset: disabled)
   Active: active (running) since Wed 2022-01-05 13:52:06 CET; 22h ago
     Docs: https://www.nomadproject.io/docs/
 Main PID: 25884 (nomad)
    Tasks: 243
   Memory: 263.1M

consul-agent.nomad system job spec

job "consul-agent" {

  datacenters = ["dev"]

  type = "system"

  update {
    max_parallel = 1
    stagger      = "30s"
  }

  priority = 100

  group "consul-agents" {

    network {
      port "http" {
        static = 8500
        to     = 8500
      }
    }

    service {
      port = "http"
      name = "consul-agent"
      tags = ["consul", "consul-agent", "http"]

      check {
        type     = "http"
        path     = "/v1/agent/checks"
        interval = "10s"
        timeout  = "5s"
      }

    }

    task "consul-agent" {

      driver = "exec"

      # Due to the limitations of Consul Connect, where consul binary must be present on Nomad's $PATH, the consul agent
      # will not be installed via the artifact stanza, it must already be installed on the host. This task will just
      # schedule it and make sure it is running with the proper config.
      #artifact {
      #  source = "https://releases.hashicorp.com/consul/1.10.3/consul_1.10.3_linux_amd64.zip"
      #}

      config {
        command = "consul"
        args    = [
          "agent",
          "-bind={{ GetInterfaceIP \"ens192\" }}",
          "-datacenter=dev",
          "-data-dir=/opt/consul",
          "-join=ip_consul_server",
          "-grpc-port=8502",
          # the addresses to which consul will bind network interfaces, including http and dns, so allowing access on
          # both loopback and external connections; this is needed to scrape metrics in Prometheus, which will try to
          # access the agent via its external IP.
          "-client={{ GetInterfaceIP \"ens192\" }} 127.0.0.1",
          # enables built-in web UI server
          "-ui"
        ]
      }

      resources {
        cpu    = 300
        memory = 256
      }
    }
  }
}

DerekStrickland · January 7, 2022, 12:44pm

Hi @cvasii. I don’t know why I didn’t think of this yesterday, so apologies. Nomad networking works by manipulating iptables entries. I don’t know firewalld specifically, but my suspicion is that when you restart it, it’s dropping all IP tables rules.

Can you test this theory by listing your iptables config before and after the firewalld restart? I suspect you’ll see that entries are getting dropped, and then after you restart the Nomad client, the allocations get rescheduled, and new entries show up.

If this is the case, I think the solution for you will be to save and reload the iptables entries on firewalld restart at the host layer, which is outside of Nomad’s scope.

Topic		Replies	Views
Restarting a job in nomad with consul connect sidecar causes the proxy to break Nomad connect , consul-nomad , consul , nomad	2	596	February 3, 2023
Existing nomad jobs in mesh need to be restarted after consul tls is enabled Consul	0	269	July 27, 2021
Getting to grips with sidecar_service, consul and service mesh Nomad connect , consul	0	456	May 3, 2022
No connection through Consul Connect (Envoy) Nomad connect , consul	4	1088	April 20, 2022
How to debug Consul sidecar in Nomad? "curl: (56) Recv failure: Connection reset by peer" Nomad connect , consul-nomad	4	1356	August 4, 2023

Consul agent unable to talk to sidecar after firewalld restart

Related topics