Workload identities for consul tokens

I’ve migrated a test cluster to use workload identities (for both vault and consul), and it seems things are working pretty well.
But when trying to replicate the same on a second cluster, it fails very early for consul.

When starting an allocation, I have the same error message

failed to setup alloc: pre-run hook "consul" failed: 1 error occurred: * failed to derive Consul token for task iscsi-controller: Unexpected response code: 403 (rpc error making call: rpc error making call: Permission denied)

I can’t find the difference with my working cluster. As soon as I remove the service_identity and task_identity section from my nomad server’s config, everything works again.

I used nomad setup consul command to setup jwks auth on Consul, pointing it at the right location, with a working CA cert. Anyone knows how can I debug this ?

In my logs (nomad and consul agent), I have a bit more info

oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:     2024-10-01T16:47:58.126+0200 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=24267f9f-b3fc-d1b8-1193-0f1cf6624279 task=iscsi-node type=Received msg="Task received by client" fail>
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:     2024-10-01T16:47:58.127+0200 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=a3e01227-f579-22a9-0714-f2b03a03d500 task=nfs-node type=Received msg="Task received by client" failed>
oct. 01 16:47:58 ct-poc-w-1 consul[1677]: 2024-10-01T16:47:58.143+0200 [ERROR] agent.client: RPC failed to server: method=ACL.Login server=10.117.7.17:8300 error="rpc error making call: Permission denied"
oct. 01 16:47:58 ct-poc-w-1 consul[1677]: 2024-10-01T16:47:58.144+0200 [ERROR] agent.http: Request error: method=POST url=/v1/acl/login from=127.0.0.1:38512 error="rpc error making call: Permission denied"
oct. 01 16:47:58 ct-poc-w-1 consul[1677]: 2024-10-01T16:47:58.144+0200 [ERROR] agent.client: RPC failed to server: method=ACL.Login server=10.117.7.17:8300 error="rpc error making call: Permission denied"
oct. 01 16:47:58 ct-poc-w-1 consul[1677]: 2024-10-01T16:47:58.144+0200 [ERROR] agent.http: Request error: method=POST url=/v1/acl/login from=127.0.0.1:38518 error="rpc error making call: Permission denied"
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:     2024-10-01T16:47:58.153+0200 [ERROR] client.alloc_runner: prerun failed: alloc_id=a3e01227-f579-22a9-0714-f2b03a03d500
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   error=
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   | pre-run hook "consul" failed: 1 error occurred:
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   | \t* failed to derive Consul token for task nfs-node: Unexpected response code: 403 (rpc error making call: Permission denied)
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   |
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:     2024-10-01T16:47:58.153+0200 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=a3e01227-f579-22a9-0714-f2b03a03d500 task=nfs-node type="Setup Failure"
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   msg=
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   | failed to setup alloc: pre-run hook "consul" failed: 1 error occurred:
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   | \t* failed to derive Consul token for task nfs-node: Unexpected response code: 403 (rpc error making call: Permission denied)
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   |
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:    failed=true
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:     2024-10-01T16:47:58.153+0200 [ERROR] client.alloc_runner: prerun failed: alloc_id=24267f9f-b3fc-d1b8-1193-0f1cf6624279
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   error=
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   | pre-run hook "consul" failed: 1 error occurred:
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   | \t* failed to derive Consul token for task iscsi-node: Unexpected response code: 403 (rpc error making call: Permission denied)
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   |
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:     2024-10-01T16:47:58.153+0200 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=24267f9f-b3fc-d1b8-1193-0f1cf6624279 task=iscsi-node type="Setup Failure"
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   msg=
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   | failed to setup alloc: pre-run hook "consul" failed: 1 error occurred:
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   | \t* failed to derive Consul token for task iscsi-node: Unexpected response code: 403 (rpc error making call: Permission denied)
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   |
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:    failed=true

But I fail to understand why the login phase reply with a permission denied

Hi @dbd,

What version of Nomad are you running? If it is < 1.8.2, I recommend upgrading to >= 1.8.2, as some significant improvements around ACL management have been introduced in that version.

ref: Consul: add preflight checks for Envoy bootstrap by tgross · Pull Request #23381 · hashicorp/nomad · GitHub

I’m running Nomad 1.8.4 and Consul 1.19.2

I’ve captured trafic on localhost:8500 during a failed allocation start (nomad agent and consul agent on the same host, nomad talks to its local consul agent unencrypted), but I strungle understanding all the workflow.

It does indeed look like the issue described here, but I get this for any workload (even without a connect section, so no envoy to bootstrap)

Can you try deleting and re-creating the Consul Auth-Method created for the Workload Identity to see if the issue persists?

Already tried a few times :wink:

I would recommend collecting the TRACE-level logs from the Nomad Agent, Consul Agent on the worker node where the alloc is running, and Consul Leader during the failure to see if they give any hint as to what is wrong.

This documentation has some details on the workflow: Consul ACL | Nomad | HashiCorp Developer

I’ll try to capture clean traces of what’s happening. I think I followed the doc, as I get it working on another test cluster (which main difference is that it’s running a single consul server, while the one I’m having problem with is a 3 nodes cluster)

So, here’s some logs while starting a job (in this example, a simple job running squid, a squid prometheus exporter, and an nginx proxy to expose metrics, but it’s irrelevant as it’s the same for any job).

  • This is the logs of the consul leader, with TRACE log_level :
    consul_leader.txt (11.8 KB)
  • This is the logs of the Nomad and Consul agent on the node trying to run the job
    nomad_consul_agent.txt (46.4 KB)

The current Consul leader is ct-poc-s-1 (10.117.7.16), and there’re 2 other Consul servers (followers) : ct-poc-s-2 and ct-poc-s-3 (respectively 10.117.7.17 and 10.117.7.18)

I fail to see anything obviously wrong in the logs (well, except the 403 of course)

The jwt auth method has been created using nomad setup consul command, and is configured like this

Name:          nomad-workloads
Type:          jwt
DisplayName:   nomad-workloads
Description:   Login method for Nomad workloads using workload identities
TokenLocality: local
Config:
{
  "BoundAudiences": [
    "consul.io"
  ],
  "ClaimMappings": {
    "consul_namespace": "consul_namespace",
    "nomad_job_id": "nomad_job_id",
    "nomad_namespace": "nomad_namespace",
    "nomad_service": "nomad_service",
    "nomad_task": "nomad_task"
  },
  "JWKSCACert": "-----BEGIN CERTIFICATE-----\nMIIFXzCCA0egAwIBAgIUdpcKkZghWOTIjCXNIEE/NvucZFwwDQYJKoZIhvcNAQEL\nBQAwNzEQMA4GA1UEChMHRWh0cmFjZTERMA8GA1UECxMIQ1QgU3RhY2sxEDAOBgNV\nBAMTB1Jvb3QgQ0EwHhcNMjIwOTA3MTYwMDA2WhcNNDIwOTAyMTYwMDMzWjA3MRAw\nDgYDVQQKEwdFaHRyYWNlMREwDwYDVQQLEwhDVCBTdGFjazEQMA4GA1UEAxMHUm9v\ndCBDQTCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoCggIBAPM1FMILPUt+cLaQ\n5/+dwNdEwRWEW+E5XksMyp7jkGwM1KecQCFrEyLX32xnRrYvciP64H+cp8ed8oqg\nxWxnEu4OKWaJYwSykqhZFU/FHYNpCfTu+obk7D21CEWJVGKN+mxD05TvP7aWSeyY\nIHOTA32t6qGOjMs2HsF16s8Apn+dW6l8iM/yTalgwIDFkZmF3dGQmy7huNIc4rVA\nPiNiVCMx6F5N4D8Jvs6keLv8nZqgcc+m8m5nsR3p8P1f+MI9ywfb/opXHTpkGQD2\nbyb00RfhB+9eIvCmm3SYreC0p+YwjNaGYprVp7IRgeVP6HCIQE/9uBvAgfBru9nS\nUrw+UBxWOdO1LHEjKN8IBH9XKXS2cV9geJInUZQTxFVbqwyp1iPuLu48KFgFNaRV\nVDu97xHOHVImgdbADmg13ti4RJqZzoSSgySq6m3rYT01FyDV3/WsG4HMAPt9kFPu\nTyEbh5H0Iq/Ukus+OQa0I+pUeyMLRG20jGB2iIDxDRPpuonhsl5t5kEDj2l4Yn4k\nqiJaOunfpfOu662zB2LpdAH6SAB26OmPParFJwHsa21CHMUhxnBTp9N22xurlW9K\nquPuT69Sie7X6pZSeXyHbzXHIdLhCP63fXC1z/b6EnMT/5jPeHHSZyMJEyhAb0cJ\nnm9f5yXfyNrfSJIaOJzwgDBuh4yDAgMBAAGjYzBhMA4GA1UdDwEB/wQEAwIBBjAP\nBgNVHRMBAf8EBTADAQH/MB0GA1UdDgQWBBRrEG976XowRmiXWPxAGlllE8r7sDAf\nBgNVHSMEGDAWgBRrEG976XowRmiXWPxAGlllE8r7sDANBgkqhkiG9w0BAQsFAAOC\nAgEAvGGqiEIGzippTD8aP3kUaJLQYdD7kRpctLVu9q4BwDqh8FJdzGcaomWT0yxo\nuPTk/OkZSE+e+3AbqAh2wdzvjlxfD0+6FeCCWw2DeZI+l3d/xrBLfWd05bJ/w+A5\ny9J+HoxkepkXab0ABEBxQWwK4SHQrAN/wzF/3OG/HDN8hKa6Yeq1iRxdSJQIdRU3\nYpzI+aq9lxb7/fsVf94K0MIb60LIy30CW8LlM9nzqguGvpgRtHJYjP2xnbfDXVCR\n+gN/Y5M/NqTUwUhs7aykzCxuP0bQQtF0tFwheMizoQo5ujMPhch88rcyaZRgjG1y\nxPfICkbwU5NiJQZ0n0r6s/fwXCS0VXdPzNoXYCDqm1tJrZ6WukNrzGwA9JlZbrN1\nnGdo2z5QR3rfavi8g5PnAV/PR9meGJWd3S1f9ppbMmZ7FG1ZsFFbcJ69Q1E3M3Uk\nnCO0TEyiivxD+Om9LZGwi1cPnWt4twR3HRQHsSOsPzUYgs4KSqF4c0m294si0fKB\nSpD4pC2Wcq5xQGewSKSNKdx1t2k5fP5EvMhKqH6u9lDZUwRgvizGDFpPnOf76uNp\nxHoQU70JbtVm+IkwMqf1R7xpuqrCc/jDXTtoj9eqpg5B8rfJvL5UWKvebdALRRMA\n7CFHuilx6aVZ/gqcpkR21QhlTR15NhTvxQwfCgOVt6R3MeQ=\n-----END CERTIFICATE-----\n",
  "JWKSURL": "https://nomad.service.ct-poc.ehtrace.com:4545/",
  "JWTSupportedAlgs": [
    "RS256"
  ]
}

The policy (policy-nomad-tasks), the role (nomad-default-tasks) and the binding rules has been created successfuly

The JWT endpoint is working, as I’m using it for vault workload identities with success.

Only Consul workload identities are broken as I can get everything working by simply commenting the service_identity and task_identity sections in my consul server config