Job launch in Service Mesh intermittent failure

I’m trying out the Consul/Nomad service mesh on a three-node development cluster (i.e., nearly zero load) where Nomad intermittently fails to register a task with Consul, or at least the registration is so delayed that the task launch fails.

The symptom is that I see an error message in the event stream that says:

envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1; see: https://www.nomadproject.io/s/envoy-bootstrap-error

Further digging determines the error is that Consul doesn’t have a relevant task registered. The envoy_bootstrap log has a single line:

failed fetch proxy config from local agent: Unexpected response code: 404 (unknown service ID: _nomad-task-67ed5ab3-2308-23d3-e1e5-8058446e3839-group-api-count-api-9001-sidecar-proxy)

Digging into the server logs finds that, in fact, there’s never a “Sync” log entry from consul. A successful launch includes the following log entry:

May 04 15:24:15 ip-172-31-57-132.ec2.internal consul[136576]: 2023-05-04T15:24:15.878Z [INFO] agent: Synced service: service=_nomad-task-5ecfec34-702d-d791-4766-ad1ba1ab378e-group-api-count-api-9001

Once the launch does succeed, all of the services successfully communicate with each other. Failures will sometimes resolve, and sometimes fail a job completely:

I’m on the latest releases of both Consul and Nomad.

I’m sure I could increase the timeouts to allow jobs to eventually succeed, but this feels wrong. It almost certainly seems like it’s a bug in the Nomad/Consul stack. I mean, why, on a system that has effectively zero load, would there be any delay in registering a task ever?

Consul config:
consul.hcl (1.3 KB)

Nomad config:
nomad.hcl (1.1 KB)

Task config:
service-mesh.hcl (1.0 KB)

Thoughts? My first impression is that Consul and Nomad are being flaky/buggy, which isn’t inspiring confidence.

Yikes! For sure not confidence inspiring Tim. What versions of Nomad and Consul are you using?

When the failure occurs do you see this log line on the Nomad Client agent?

[DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=0 deregistered_checks=0

That log line indicates when Nomad has successfully registered the service with Consul and will at least help us narrow down if this is an issue on the Nomad side or Consul side.

The link Nomad logs has some other useful debugging information, but if Consul is just not propagating the registration in a timely fashion I doubt any further Nomad debugging is useful: Consul Service Mesh | Nomad | HashiCorp Developer

Hi! Thanks for the reply!

I’m on Nomad 1.5.4 and Consul 1.15.2.

I’ve been digging through logs and looking for anything that might be relevant. The Service Mesh aspect that I was trying to get working seemed to be just a symptom of a larger problem, so I disabled the Service Mesh functionality entirely and am now just trying to get jobs to deploy reliably (and am using Nginx as a load balancer, so I don’t need the Service Mesh any more). Which is unfortunately still a challenge.

In answer to your specific question, I don’t see that line in the logs at all, but my current logs may not contain any of the Service Mesh logs as I frequently nuke and reallocate servers. I do see:

May 09 17:08:11 ip-172-31-51-67.ec2.internal nomad[639112]:     2023-05-09T17:08:11.775Z [ERROR] consul.sync: still unable to update services in Consul: failures=180 error="Unexpected response code: 400 (Invalid service address)"
May 09 17:08:11 ip-172-31-51-67.ec2.internal nomad[639112]: consul.sync: still unable to update services in Consul: failures=180 error="Unexpected response code: 400 (Invalid service address)"

That said…sometimes the services do get registered with Consul. Eventually. It’s very intermittent. I’ve also seen Consul having a hard time keeping all of its nodes alive; I’ll be looking at the dashboard and suddenly one of them will die, and then after a few seconds it will come back to life.

I found a few things and changed them since the above. Note that I’m effectively shooting in the dark here, so I don’t know which if any of these changes could help:

  1. The node_name for each was defaulting to something broken and I saw lots of warnings. So I set the domain names to something that Consul wouldn’t complain about.
  2. Similar for the name for Nomad.
  3. I kept seeing “Invalid service address” entries in the Consul log. Digging through Consul source I was able to determine that meant something was trying to connect to the address “0.0.0.0” (!!). So I went through Nomad and Consul settings and removed most of the “0.0.0.0” entries; I did it for both because it was unclear where the 0.0.0.0 was coming from.
  4. I tried adding provider = "consul" everywhere, though that didn’t even result in jobs redeploying, so I’m guessing that was the default.
  5. I ended up setting client_addr = "127.0.0.1" and changed the addresses for http and dns to include the local client public IP as well as “127.0.0.1”.

After all of the changes above I finally got rid of the “Invalid service address” errors. Note that those errors continued even after I set client_addr = "1.2.3.4 127.0.0.1" (where 1.2.3.4 is the local IP), so I’m not sure exactly how that would have turned into 0.0.0.0).

Now I’m seeing this new line a lot:

May 09 17:49:58 ip-172-31-51-67.ec2.internal consul[654093]: 2023-05-09T17:49:58.023Z [WARN]  agent: Check is now critical: check=_nomad-check-feb9f619cc947669502e87c062f0f260fab693fb

Every ten seconds, it look like. This could be leftover cruft from an earlier…something…that hasn’t yet given up the ghost. Your message came is as I was actively debugging the config.

Note that the Nomad client wasn’t showing up at all in Consul before now. I was seeing Nomad servers but not clients.

That said, I literally just now was able to deploy my whole application, so something I did above must have fixed the problem that was causing? Consul doesn’t seem to be losing servers periodically either. So … I may have fixed it?

There are almost certainly some bugs (or missing features that sanity-check the config?) under the covers here. I really wasn’t doing anything complex, and it just didn’t work.

For posterity, here are the example config files from one of the servers (they’re generated per-server from a script):

consul.hcl (1.4 KB)

nomad.hcl (1.2 KB)

And the job that was having the most trouble deploying:

job.hcl (2.6 KB)

So now…it works? I guess?

Thanks for reaching out anyway. Hope the underlying bugs get fixed so that future users don’t end up fighting with the same frustrating issues.

Let me know if I should be worried about that “Check is now critical” warning. I’ll probably nuke the servers and re-deploy fresh to see if it goes away.

It’s also worth noting that I’m testing this on EC2 instances that I’m spinning up in an auto-scaling group, based on Amazon Linux 2023.

This may or may not be relevant to the bug; the default node_name was broken because it was coming up as an AWS-default domain name, e.g., ip-172-31-51-108.ec2.internal. It was complaining because the dots were invalid?

Maybe when node_name is falling back to a default, it should convert dots to dashes? ip-172-31-51-108-ec2-internal could have worked as a node name, for instance. Not sure if that was the only problem, though, or why the problem was intermittent.

And yes, this could have been a Consul bug and not Nomad; sorry if I guessed wrong and picked the wrong forum to post in, though the default Nomad name was also coming up as an illegal name, so maybe the bug is in both products?

EDIT: One more follow-up: The above warning went away after nuking and re-deploying.