"No healthy host for TCP connection pool" - Nomad and Consul Connect

I’m having a hard time figuring out what’s wrong with this minimal example of running Nomad and Consul Connect.

I’m following along with Consul Connect | Nomad by HashiCorp but with slight modifications (using netcat instead of socat, and not running Nomad in dev mode but instead in a 3 client/3 server cluster in Vagrant).

My downstream service cannot talk to my upstream service. Looking at the logs of the connect task, I see a lot of no healthy host for TCP connection pool:

[2021-02-02 04:14:06.645][14][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:389] [C134] Creating connection to cluster exec-upstream-service.default.dc1.internal.1be84599-6568-253c-820b-7b161b4193f3.consul
[2021-02-02 04:14:06.645][14][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1417] no healthy host for TCP connection pool

Here’s the job that I’m running:

job "exec-services" {
  datacenters = ["dc1"]
  type = "service"
  group "group" {
    network {
      mode = "bridge"
      port "upstream" { to = "8181" }
    }

    service {
      name = "exec-upstream-service"
      port = "upstream"
      connect {
        sidecar_service {}
      }
    }

    task "exec-upstream-service" {
      driver = "exec"
      config {
        command = "/bin/sh"
        args = [
          "-c",
          "while true; do printf 'HTTP/1.1 200 OK\nContent-Type: text/plain; charset=UTF-8\nServer: netcat\n\nHello, world.\n'  | nc -w 10 -p 8181 -l; sleep 1; done"
        ]
      }
    }

    service {
      name = "exec-downstream-service"
      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "exec-upstream-service"
              local_bind_port = 9191
            }
          }
        }
      }
    }

    task "exec-downstream-service" {
      driver = "exec"
      config {
        command = "/bin/sh"
        args = [
          "-c",
          "echo \"starting\"; while true; do nc -w 1 localhost 9191; sleep 1; done"
        ]
      }
    }
  }
}

Here’s some results of further exploration.

I tried using the exact job config from Consul Service Mesh | Nomad | HashiCorp Developer.

When I run that with consul agent -dev and nomad agent -dev-connect, it works just like the docs explain.

When I try to run that in my Vagrant Consul/Nomad cluster, the dashboard can’t connect to the API and when I look at the connect service task logs I see the same “No healthy host for TCP connection pool” error message.

I checked intentions and there is an allow intention from the dashboard to the api:

vagrant@server-0:~$ consul intention match count-api
count-dashboard => count-api (allow)

ACL is disabled.

vagrant@server-0:~$ consul acl policy list
Failed to retrieve the policy list: Unexpected response code: 401 (ACL support disabled)

CNI plugins have been installed

vagrant@server-0:~ cat (ls -d /proc/sys/net/bridge/bridge-nf-call*)
1
1
1

If it’s relevant, here is my Vagrantfile.

Vagrant.configure("2") do |config|
  vm_name = ["server-0", "server-1", "server-2"]
  vm_name.each_with_index do |name, i|
    config.vm.define "#{name}" do |node|
      node.vm.box = "example"
      node.vm.hostname = name
      node.ssh.port = "220#{i}"
      node.vm.network "forwarded_port", id: "ssh", guest: 22, host: "220#{i}", host_ip: "127.0.0.1"
      node.vm.network "private_network", ip: "192.168.121.10#{i}"
      node.vm.provider "virtualbox" do |v|
        v.memory = 512
        v.cpus = 1
      end
    end
  end

  vm_name = ["client-0", "client-1", "client-2"]
  vm_name.each_with_index do |name, i|
    config.vm.define "#{name}" do |node|
      node.vm.box = "example"
      node.vm.hostname = name
      node.ssh.port = "221#{i}"
      node.vm.network "forwarded_port", id: "ssh", guest: 22, host: "221#{i}", host_ip: "127.0.0.1"
      node.vm.network "private_network", ip: "192.168.121.11#{i}"
      node.vm.provider "virtualbox" do |v|
        v.memory = 1536
        v.cpus = 2
      end
    end
  end
end

And the example box is built with Packer and bootstrapped with this script:

NOMAD_VERSION="1.0.1"
CONSUL_VERSION="1.9.1"

sleep 30

# Wait for DigitalOcean to finis its setup
ps aux | grep '\bapt-get\b' > /dev/null
while [ $? -eq "0" ]
do
    sleep 1
    ps aux | grep '\bapt-get\b' > /dev/null
done

sleep 1

sudo apt-get update
sudo apt-get install -y unzip

# Docker (for driver)
sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"
sudo apt-get update
sudo apt-get install -y docker-ce

# Java (for driver)
sudo apt-get install -y openjdk-11-jdk

# Consul
curl -L "https://releases.hashicorp.com/consul/${CONSUL_VERSION}/consul_${CONSUL_VERSION}_linux_amd64.zip" \
    -o consul.zip
unzip consul.zip
sudo chown root:root consul
sudo mv consul /usr/local/bin/
rm -f consul.zip
sudo mkdir --parents /etc/consul.d
sudo chown --recursive consul:consul /etc/consul.d/
sudo useradd --system --home /etc/consul.d --shell /bin/false consul
sudo mkdir --parents /opt/consul
sudo chown --recursive consul:consul /opt/consul/

# Nomad
curl -L "https://releases.hashicorp.com/nomad/${NOMAD_VERSION}/nomad_${NOMAD_VERSION}_linux_amd64.zip"\
    -o nomad.zip
unzip nomad.zip
sudo chown root:root nomad
sudo mv nomad /usr/local/bin/
rm -f nomad.zip
sudo mkdir --parents /etc/nomad.d
sudo chown --recursive nomad:nomad /etc/nomad.d
sudo useradd --system --home /etc/nomad.d --shell /bin/false nomad
sudo mkdir --parents /opt/nomad
sudo chown --recursive nomad:nomad /opt/nomad

# Envoy
curl -L https://getenvoy.io/cli | sudo bash -s -- -b /usr/local/bin \
    getenvoy run standard:1.16.2 -- --version
sudo cp ~/.getenvoy/builds/standard/1.16.2/linux_glibc/bin/envoy /usr/local/bin/

# CNI plugins so Nomad can configure the network namespace for
# Consul Connect sidecar proxy.
curl -L -o cni-plugins.tgz \
    https://github.com/containernetworking/plugins/releases/download/v0.8.6/cni-plugins-linux-amd64-v0.8.6.tgz
sudo mkdir -p /opt/cni/bin
sudo tar -C /opt/cni/bin -xzf cni-plugins.tgz
echo 'net.bridge.bridge-nf-call-arp-tables = 1\nnet.bridge.bridge-nf-call-ip6tables = 1\nnet.bridge.bridge-nf-call-iptables = 1' | sudo tee -a /etc/sysctl.d/10-bridge-nf-call.conf

Hi @eihli,

At first glance I don’t see anything wrong with your jobfile.

You mentioned that consul agent -dev works, so I am wondering if you have the grpc port set and connect enabled in your Consul config.

I think these are two main differences for Connect between dev and non-dev modes.

connect is enabled and the grpc port is set. The topology shows a healthy connection between the two services.

I’ll paste my configs below. (They are templated with ansible, so some of it’s not raw hcl, but I didn’t want to manually edit something and introduce typo error artifacts that would add to the confusion.)

This is my consul server config:

#-*-HCL-*-
datacenter = "{{ datacenter }}"
data_dir = "/opt/consul"
server = true
bootstrap_expect = {{ bootstrap_expect }}
node_name = "{{ ansible_hostname }}"

ui_config {
  enabled = true
}
ports {
  grpc = 8502
}

connect {
  enabled = true
}

retry_join = [
  {% for host in groups['servers'] %}
  "{{hostvars[host]['ansible_eth1']['ipv4']['address']}}",
  {% endfor %}
]

bind_addr = {{ '"{{ GetInterfaceIP \\\"eth1\\\" }}"' }}

And my Nomad configs annotated by common/client/server:

## common to both server and client
log_level = "DEBUG"
datacenter = "{{ datacenter }}"
data_dir = "/opt/nomad"

{% raw %}
advertise {
  http = "{{ GetInterfaceIP \"eth1\" }}"
  rpc = "{{ GetInterfaceIP \"eth1\" }}"
  serf = "{{ GetInterfaceIP \"eth1\" }}"
}
{% endraw %}


consul {
  address = "127.0.0.1:8500"
}
## end common

## client only
client {
  enabled = true

  server_join {
    retry_join = [
      {% for host in groups['servers'] %}
      "{{ host }}",
      {% endfor %}
    ]
    retry_max = 3
    retry_interval = "15s"
  }
## end client

## server only
server {
  enabled = true
  bootstrap_expect = {{ bootstrap_expect }}

  server_join {
    retry_join = [
      {% for host in groups['servers'] %}
      "{{hostvars[host]['ansible_eth1']['ipv4']['address']}}",
      {% endfor %}
    ]
    retry_max = 3
    retry_interval = "15s"
  }
}
## end server

Ah ok, thanks for the extra info.

Are the Nomad clients running as root? I see in your bootstrap script that you create a nomad user. If they are not, could you try like that?

This is my nomad-client.service file. The nomad clients are running as root.

[Unit]
Description=Nomad
Documentation=https://nomadproject.io/docs/
Wants=network-online.target
After=network-online.target

[Service]
User=root
Group=root
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/bin/nomad agent -client -config /etc/nomad.d/common.hcl -config /etc/nomad.d/client.hcl -config /etc/nomad.d/stateful-client.hcl
KillMode=process
KillSignal=SIGINT
LimitNOFILE=infinity
LimitNPROC=infinity
Restart=on-failure
RestartSec=2
StartLimitBurst=3
StartLimitIntervalSec=10
TasksMax=infinity

[Install]
WantedBy=multi-user.target

Where do the “Host Address” entries come from in the below screenshot? These are the envoy proxy sidecar tasks that are automatically created by my jobs. I just noticed they are on a private IP but I can’t find where to change the config. They are defaulting to eth0 but I’d like to use the gosockaddr templating to make them eth1. I think this could be the issue, yeah?

Those address are defined in the client configuration by specifying the network_interface parameter. If the proxies are not able to communicate over that interface it could be the problem.

Try setting it to the eth1 interface as you mentioned. You won’t need to use gosockaddr templateting as it takes the network interface name directly.

client {
  ...
  network_interface = "eth1"
  ...
}

Give a try and let me know if it still doesn’t work.

Thanks @lgfa29, adding that netwokr_interface line did get the sidecars on the proper interface. But it looks like my proxies still can’t find any healthy hosts. I’m still getting this in the debug logs of the sidecar task.

[...tcp_proxy.cc:389] [C466] Creating connection to cluster exec-upstream-service.default.dc1.internal.a9b3cd58-98fb-d24c-c75d-14672dc84100.consul
[...cluster_manager_impl.cc:1417] no healthy host for TCP connection pool

I put all of my code in this Github repo so that it’s reproducible and every line of code is browseable.

My Consul services health checks all look good.

But I still can’t talk over that upstream proxy. Sometimes I get Connection reset by peer. Other times I get Empty reply from server. It seems to change randomly. Nomad logs just keep spamming that

vagrant@server-0:~$ nomad exec -task exec-downstream-service cdc03950 /bin/bash
nobody@client-1:/$ curl localhost:9191
curl: (56) Recv failure: Connection reset by peer
nobody@client-1:/$ curl localhost:9191
curl: (52) Empty reply from server
nobody@client-1:/$ curl localhost:8181
Hello, world.

Thanks for the repo, that was really helpful :smiley:

Sorry I missed this in your original message, but I think the problem is that you have both tasks in the same group.

In Nomad a group defines a network namespace, so in this scenario you don’t need Consul Connect at all, the exec-upstream-service and exec-downstream-service would be able to connect over their shared localhost (which seems to happen sometimes as you get the expected Hello, world. response sporadically).

What I think it’s happening is that your proxies are clashing with each other and the tasks. Try running each task in a separate group and see if it fixes the problem.

Bingo! Thank you! That was it.

I created that test job with both tasks in a single group as a way to simplify to try to isolate the cause of the problem. The original cause was the network_interface defaulting to eth0 and needing to set it to eth1. But in the process of trying to simplify, I moved the tasks to the same group, which introduced a different problem that resulted in the same behavior. That was the confusing part. Fix one thing, break another, get the same behavior.

1 Like

Nice! I’m glad it’s working now :grinning: