I’m having a hard time figuring out what’s wrong with this minimal example of running Nomad and Consul Connect.
I’m following along with Consul Connect | Nomad by HashiCorp but with slight modifications (using netcat instead of socat, and not running Nomad in dev mode but instead in a 3 client/3 server cluster in Vagrant).
My downstream service cannot talk to my upstream service. Looking at the logs of the connect task, I see a lot of no healthy host for TCP connection pool:
[2021-02-02 04:14:06.645][14][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:389] [C134] Creating connection to cluster exec-upstream-service.default.dc1.internal.1be84599-6568-253c-820b-7b161b4193f3.consul
[2021-02-02 04:14:06.645][14][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1417] no healthy host for TCP connection pool
When I run that with consul agent -dev and nomad agent -dev-connect, it works just like the docs explain.
When I try to run that in my Vagrant Consul/Nomad cluster, the dashboard can’t connect to the API and when I look at the connect service task logs I see the same “No healthy host for TCP connection pool” error message.
I checked intentions and there is an allow intention from the dashboard to the api:
vagrant@server-0:~$ consul intention match count-api
count-dashboard => count-api (allow)
ACL is disabled.
vagrant@server-0:~$ consul acl policy list
Failed to retrieve the policy list: Unexpected response code: 401 (ACL support disabled)
connect is enabled and the grpc port is set. The topology shows a healthy connection between the two services.
I’ll paste my configs below. (They are templated with ansible, so some of it’s not raw hcl, but I didn’t want to manually edit something and introduce typo error artifacts that would add to the confusion.)
Where do the “Host Address” entries come from in the below screenshot? These are the envoy proxy sidecar tasks that are automatically created by my jobs. I just noticed they are on a private IP but I can’t find where to change the config. They are defaulting to eth0 but I’d like to use the gosockaddr templating to make them eth1. I think this could be the issue, yeah?
Those address are defined in the client configuration by specifying the network_interface parameter. If the proxies are not able to communicate over that interface it could be the problem.
Try setting it to the eth1 interface as you mentioned. You won’t need to use gosockaddr templateting as it takes the network interface name directly.
client {
...
network_interface = "eth1"
...
}
Give a try and let me know if it still doesn’t work.
Thanks @lgfa29, adding that netwokr_interface line did get the sidecars on the proper interface. But it looks like my proxies still can’t find any healthy hosts. I’m still getting this in the debug logs of the sidecar task.
[...tcp_proxy.cc:389] [C466] Creating connection to cluster exec-upstream-service.default.dc1.internal.a9b3cd58-98fb-d24c-c75d-14672dc84100.consul
[...cluster_manager_impl.cc:1417] no healthy host for TCP connection pool
I put all of my code in this Github repo so that it’s reproducible and every line of code is browseable.
But I still can’t talk over that upstream proxy. Sometimes I get Connection reset by peer. Other times I get Empty reply from server. It seems to change randomly. Nomad logs just keep spamming that
Sorry I missed this in your original message, but I think the problem is that you have both tasks in the same group.
In Nomad a group defines a network namespace, so in this scenario you don’t need Consul Connect at all, the exec-upstream-service and exec-downstream-service would be able to connect over their shared localhost (which seems to happen sometimes as you get the expected Hello, world. response sporadically).
What I think it’s happening is that your proxies are clashing with each other and the tasks. Try running each task in a separate group and see if it fixes the problem.
I created that test job with both tasks in a single group as a way to simplify to try to isolate the cause of the problem. The original cause was the network_interface defaulting to eth0 and needing to set it to eth1. But in the process of trying to simplify, I moved the tasks to the same group, which introduced a different problem that resulted in the same behavior. That was the confusing part. Fix one thing, break another, get the same behavior.