I’m having a bit of an issue with Nomad + Consul in that everything appears to be working fine, but my Consul logs on the (single) client are filling up with the following errors:
I ran through the steps from Troubleshoot Consul ACL issues – HashiCorp Help Center and checked client’s token /opt/consul/acl-tokens.json is valid - I set it to my own CONSUL_HTTP_TOKEN environment variable and I can do consul catalog services and get my services back, which shows the Consul server recognises it as a valid token.
What’s interesting is that if I restart consul via systemctl restart consul, the errors go away. Until I run a Nomad job. At which point, these errors appear with a topic= corresponding to each service I used discovery with in the Nomad job. I’m using the JWT mechanism outlined in Consul ACL with Nomad Workload Identities | Nomad | HashiCorp Developer to authenticate my Nomad workloads against Consul.
At this point, I’m pretty stuck - it doesn’t seem to be causing any issues, and all my services report healthy in Consul and can be discovered from other jobs. But I don’t really want to start relying on the cluster while it is spamming errors
Does anyone know how I might be able to get the token that is supposedly not found and work out where it is coming from? My only clue as to what is sending these requests is the from=127.0.0.1:46172 in one of the log lines, but that port number doesn’t correspond to any running service.
I think the error obviously complains about incorrect ACL token on the agent.
Instead of setting env environment, you can configure the agent to present the token by specifying the token in the agent configure file
Despite all that, I am still getting these errors in the logs whenever I deploy a service. It’s really confusing as I have personally validated that <token> is a valid Consul token and can be used to operate the CLI, so ‘ACL not found’ makes no sense unless the agent is somehow using a different token.
Hey guys, I have the same issue. These ACL errors happen only when I run a Nomad job (Traefik with Consul Catalog integration). My setup was working fine until suddenly it stopped. All my services are reported as healthy in Consul, but since these errors started, Traefik now returns a bad gateway for those services from one specific client. The issue is also inconsistent. After many restarts and redeploying the same job, the errors disappear and everything works for a while. Then the cycle repeats. What is going on?
@rincler do you have any updates, I know it’s been long time.
I don’t know if it helps, but at least in our case I think the issue was related to the Traefik Consul Catalog implementation and continuously abruptly stopping it when redeploying its job. The Traefik Consul Catalog provider has a watch option that we enabled: