I’ve migrated a test cluster to use workload identities (for both vault and consul), and it seems things are working pretty well.
But when trying to replicate the same on a second cluster, it fails very early for consul.
When starting an allocation, I have the same error message
failed to setup alloc: pre-run hook "consul" failed: 1 error occurred: * failed to derive Consul token for task iscsi-controller: Unexpected response code: 403 (rpc error making call: rpc error making call: Permission denied)
I can’t find the difference with my working cluster. As soon as I remove the service_identity and task_identity section from my nomad server’s config, everything works again.
I used nomad setup consul command to setup jwks auth on Consul, pointing it at the right location, with a working CA cert. Anyone knows how can I debug this ?
In my logs (nomad and consul agent), I have a bit more info
What version of Nomad are you running? If it is < 1.8.2, I recommend upgrading to >= 1.8.2, as some significant improvements around ACL management have been introduced in that version.
I’ve captured trafic on localhost:8500 during a failed allocation start (nomad agent and consul agent on the same host, nomad talks to its local consul agent unencrypted), but I strungle understanding all the workflow.
I would recommend collecting the TRACE-level logs from the Nomad Agent, Consul Agent on the worker node where the alloc is running, and Consul Leader during the failure to see if they give any hint as to what is wrong.
I’ll try to capture clean traces of what’s happening. I think I followed the doc, as I get it working on another test cluster (which main difference is that it’s running a single consul server, while the one I’m having problem with is a 3 nodes cluster)
So, here’s some logs while starting a job (in this example, a simple job running squid, a squid prometheus exporter, and an nginx proxy to expose metrics, but it’s irrelevant as it’s the same for any job).
This is the logs of the consul leader, with TRACE log_level : consul_leader.txt (11.8 KB)
This is the logs of the Nomad and Consul agent on the node trying to run the job nomad_consul_agent.txt (46.4 KB)
The current Consul leader is ct-poc-s-1 (10.117.7.16), and there’re 2 other Consul servers (followers) : ct-poc-s-2 and ct-poc-s-3 (respectively 10.117.7.17 and 10.117.7.18)
I fail to see anything obviously wrong in the logs (well, except the 403 of course)
The jwt auth method has been created using nomad setup consul command, and is configured like this
The policy (policy-nomad-tasks), the role (nomad-default-tasks) and the binding rules has been created successfuly
The JWT endpoint is working, as I’m using it for vault workload identities with success.
Only Consul workload identities are broken as I can get everything working by simply commenting the service_identity and task_identity sections in my consul server config