Hi! Thanks for the reply!
I’m on Nomad 1.5.4 and Consul 1.15.2.
I’ve been digging through logs and looking for anything that might be relevant. The Service Mesh aspect that I was trying to get working seemed to be just a symptom of a larger problem, so I disabled the Service Mesh functionality entirely and am now just trying to get jobs to deploy reliably (and am using Nginx as a load balancer, so I don’t need the Service Mesh any more). Which is unfortunately still a challenge.
In answer to your specific question, I don’t see that line in the logs at all, but my current logs may not contain any of the Service Mesh logs as I frequently nuke and reallocate servers. I do see:
May 09 17:08:11 ip-172-31-51-67.ec2.internal nomad[639112]: 2023-05-09T17:08:11.775Z [ERROR] consul.sync: still unable to update services in Consul: failures=180 error="Unexpected response code: 400 (Invalid service address)"
May 09 17:08:11 ip-172-31-51-67.ec2.internal nomad[639112]: consul.sync: still unable to update services in Consul: failures=180 error="Unexpected response code: 400 (Invalid service address)"
That said…sometimes the services do get registered with Consul. Eventually. It’s very intermittent. I’ve also seen Consul having a hard time keeping all of its nodes alive; I’ll be looking at the dashboard and suddenly one of them will die, and then after a few seconds it will come back to life.
I found a few things and changed them since the above. Note that I’m effectively shooting in the dark here, so I don’t know which if any of these changes could help:
- The
node_name
for each was defaulting to something broken and I saw lots of warnings. So I set the domain names to something that Consul wouldn’t complain about.
- Similar for the
name
for Nomad.
- I kept seeing “Invalid service address” entries in the Consul log. Digging through Consul source I was able to determine that meant something was trying to connect to the address “0.0.0.0” (!!). So I went through Nomad and Consul settings and removed most of the “0.0.0.0” entries; I did it for both because it was unclear where the 0.0.0.0 was coming from.
- I tried adding
provider = "consul"
everywhere, though that didn’t even result in jobs redeploying, so I’m guessing that was the default.
- I ended up setting
client_addr = "127.0.0.1"
and changed the addresses for http
and dns
to include the local client public IP as well as “127.0.0.1”.
After all of the changes above I finally got rid of the “Invalid service address” errors. Note that those errors continued even after I set client_addr = "1.2.3.4 127.0.0.1"
(where 1.2.3.4 is the local IP), so I’m not sure exactly how that would have turned into 0.0.0.0
).
Now I’m seeing this new line a lot:
May 09 17:49:58 ip-172-31-51-67.ec2.internal consul[654093]: 2023-05-09T17:49:58.023Z [WARN] agent: Check is now critical: check=_nomad-check-feb9f619cc947669502e87c062f0f260fab693fb
Every ten seconds, it look like. This could be leftover cruft from an earlier…something…that hasn’t yet given up the ghost. Your message came is as I was actively debugging the config.
Note that the Nomad client wasn’t showing up at all in Consul before now. I was seeing Nomad servers but not clients.
That said, I literally just now was able to deploy my whole application, so something I did above must have fixed the problem that was causing? Consul doesn’t seem to be losing servers periodically either. So … I may have fixed it?
There are almost certainly some bugs (or missing features that sanity-check the config?) under the covers here. I really wasn’t doing anything complex, and it just didn’t work.
For posterity, here are the example config files from one of the servers (they’re generated per-server from a script):
consul.hcl (1.4 KB)
nomad.hcl (1.2 KB)
And the job that was having the most trouble deploying:
job.hcl (2.6 KB)
So now…it works? I guess?
Thanks for reaching out anyway. Hope the underlying bugs get fixed so that future users don’t end up fighting with the same frustrating issues.
Let me know if I should be worried about that “Check is now critical” warning. I’ll probably nuke the servers and re-deploy fresh to see if it goes away.