Nomad v1.2.3 (a79efc8422082c4790046c3f5ad92c542592a54f)
Consul v1.11.2
I have a nomad cluster that has been able to use consul connect to connect several different services together for months now. Last night I had a ceph volume get stuck forcing the whole cluster to go offline. when I brought it back up anything that used consul connect only works if the service is on the same client. if not it doe snot connect. I am using the standard counter job to test.
I am not 100% sure where to specifically look for logs but nomad does have
running [/usr/sbin/iptables -t nat -X CNI-0dd845c33aab4590db0e3831 --wait]: exit status 1: iptables: Too many links.
and consul
agent: grpc: addrConn.createTransport failed to connect to {dc1-10.0.31.106:8300 0 s-e45f015ee1ae <nil>}.
Before I noticed this I had reset a bunch of consul ACL keys but I dont remember setting any just for consul connect.
Any advice would be greatly appreciated. Thanks
if I follow this, and just run it in screen sessions on the same nodes it does work though
I am fairly sure this is down to an issue with my nomad cluster and not consul or the job file or the actual machine? well less sure on the actual machines but since the same type of application works on the machines but not in nomad I think that is accurate.
Normally I would just blow away the cluster and rebuild but with the ceph volumes I am afraid of the amount of work it would take to recreate and reattach. any easy way to do this or further thoughts on the consul connect issue localized to nomad?
Any ideas, or am I stuck rebuilding. would this make more sense in the consul category even though it seems to be nomad specific?
Ok so I now know that this is issue with the client hosts and the docker nomad network but not super sure where to go. The Tasks get a address on 172.26.65.13/20 and can ping 172.26.64.1/20 on the host but cant access anything past that.