Workload identities for consul tokens

dbd · October 1, 2024, 3:08pm

I’ve migrated a test cluster to use workload identities (for both vault and consul), and it seems things are working pretty well.
But when trying to replicate the same on a second cluster, it fails very early for consul.

When starting an allocation, I have the same error message

failed to setup alloc: pre-run hook "consul" failed: 1 error occurred: * failed to derive Consul token for task iscsi-controller: Unexpected response code: 403 (rpc error making call: rpc error making call: Permission denied)

I can’t find the difference with my working cluster. As soon as I remove the service_identity and task_identity section from my nomad server’s config, everything works again.

I used nomad setup consul command to setup jwks auth on Consul, pointing it at the right location, with a working CA cert. Anyone knows how can I debug this ?

In my logs (nomad and consul agent), I have a bit more info

oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:     2024-10-01T16:47:58.126+0200 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=24267f9f-b3fc-d1b8-1193-0f1cf6624279 task=iscsi-node type=Received msg="Task received by client" fail>
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:     2024-10-01T16:47:58.127+0200 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=a3e01227-f579-22a9-0714-f2b03a03d500 task=nfs-node type=Received msg="Task received by client" failed>
oct. 01 16:47:58 ct-poc-w-1 consul[1677]: 2024-10-01T16:47:58.143+0200 [ERROR] agent.client: RPC failed to server: method=ACL.Login server=10.117.7.17:8300 error="rpc error making call: Permission denied"
oct. 01 16:47:58 ct-poc-w-1 consul[1677]: 2024-10-01T16:47:58.144+0200 [ERROR] agent.http: Request error: method=POST url=/v1/acl/login from=127.0.0.1:38512 error="rpc error making call: Permission denied"
oct. 01 16:47:58 ct-poc-w-1 consul[1677]: 2024-10-01T16:47:58.144+0200 [ERROR] agent.client: RPC failed to server: method=ACL.Login server=10.117.7.17:8300 error="rpc error making call: Permission denied"
oct. 01 16:47:58 ct-poc-w-1 consul[1677]: 2024-10-01T16:47:58.144+0200 [ERROR] agent.http: Request error: method=POST url=/v1/acl/login from=127.0.0.1:38518 error="rpc error making call: Permission denied"
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:     2024-10-01T16:47:58.153+0200 [ERROR] client.alloc_runner: prerun failed: alloc_id=a3e01227-f579-22a9-0714-f2b03a03d500
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   error=
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   | pre-run hook "consul" failed: 1 error occurred:
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   | \t* failed to derive Consul token for task nfs-node: Unexpected response code: 403 (rpc error making call: Permission denied)
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   |
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:     2024-10-01T16:47:58.153+0200 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=a3e01227-f579-22a9-0714-f2b03a03d500 task=nfs-node type="Setup Failure"
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   msg=
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   | failed to setup alloc: pre-run hook "consul" failed: 1 error occurred:
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   | \t* failed to derive Consul token for task nfs-node: Unexpected response code: 403 (rpc error making call: Permission denied)
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   |
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:    failed=true
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:     2024-10-01T16:47:58.153+0200 [ERROR] client.alloc_runner: prerun failed: alloc_id=24267f9f-b3fc-d1b8-1193-0f1cf6624279
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   error=
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   | pre-run hook "consul" failed: 1 error occurred:
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   | \t* failed to derive Consul token for task iscsi-node: Unexpected response code: 403 (rpc error making call: Permission denied)
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   |
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:     2024-10-01T16:47:58.153+0200 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=24267f9f-b3fc-d1b8-1193-0f1cf6624279 task=iscsi-node type="Setup Failure"
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   msg=
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   | failed to setup alloc: pre-run hook "consul" failed: 1 error occurred:
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   | \t* failed to derive Consul token for task iscsi-node: Unexpected response code: 403 (rpc error making call: Permission denied)
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:   |
oct. 01 16:47:58 ct-poc-w-1 nomad[1704]:    failed=true

But I fail to understand why the login phase reply with a permission denied

Ranjandas · October 2, 2024, 11:50am

Hi @dbd,

What version of Nomad are you running? If it is < 1.8.2, I recommend upgrading to >= 1.8.2, as some significant improvements around ACL management have been introduced in that version.

ref: Consul: add preflight checks for Envoy bootstrap by tgross · Pull Request #23381 · hashicorp/nomad · GitHub

dbd · October 2, 2024, 11:51am

I’m running Nomad 1.8.4 and Consul 1.19.2

dbd · October 2, 2024, 11:54am

I’ve captured trafic on localhost:8500 during a failed allocation start (nomad agent and consul agent on the same host, nomad talks to its local consul agent unencrypted), but I strungle understanding all the workflow.

dbd · October 2, 2024, 12:00pm

It does indeed look like the issue described here, but I get this for any workload (even without a connect section, so no envoy to bootstrap)

Ranjandas · October 2, 2024, 12:02pm

Can you try deleting and re-creating the Consul Auth-Method created for the Workload Identity to see if the issue persists?

dbd · October 2, 2024, 12:03pm

Already tried a few times

Ranjandas · October 2, 2024, 12:42pm

I would recommend collecting the TRACE-level logs from the Nomad Agent, Consul Agent on the worker node where the alloc is running, and Consul Leader during the failure to see if they give any hint as to what is wrong.

This documentation has some details on the workflow: Consul ACL | Nomad | HashiCorp Developer

dbd · October 3, 2024, 7:27am

I’ll try to capture clean traces of what’s happening. I think I followed the doc, as I get it working on another test cluster (which main difference is that it’s running a single consul server, while the one I’m having problem with is a 3 nodes cluster)

dbd · October 3, 2024, 3:02pm

So, here’s some logs while starting a job (in this example, a simple job running squid, a squid prometheus exporter, and an nginx proxy to expose metrics, but it’s irrelevant as it’s the same for any job).

This is the logs of the consul leader, with TRACE log_level :
consul_leader.txt (11.8 KB)
This is the logs of the Nomad and Consul agent on the node trying to run the job
nomad_consul_agent.txt (46.4 KB)

The current Consul leader is ct-poc-s-1 (10.117.7.16), and there’re 2 other Consul servers (followers) : ct-poc-s-2 and ct-poc-s-3 (respectively 10.117.7.17 and 10.117.7.18)

I fail to see anything obviously wrong in the logs (well, except the 403 of course)

The jwt auth method has been created using nomad setup consul command, and is configured like this

Name:          nomad-workloads
Type:          jwt
DisplayName:   nomad-workloads
Description:   Login method for Nomad workloads using workload identities
TokenLocality: local
Config:
{
  "BoundAudiences": [
    "consul.io"
  ],
  "ClaimMappings": {
    "consul_namespace": "consul_namespace",
    "nomad_job_id": "nomad_job_id",
    "nomad_namespace": "nomad_namespace",
    "nomad_service": "nomad_service",
    "nomad_task": "nomad_task"
  },
  "JWKSCACert": "-----BEGIN CERTIFICATE-----\nMIIFXzCCA0egAwIBAgIUdpcKkZghWOTIjCXNIEE/NvucZFwwDQYJKoZIhvcNAQEL\nBQAwNzEQMA4GA1UEChMHRWh0cmFjZTERMA8GA1UECxMIQ1QgU3RhY2sxEDAOBgNV\nBAMTB1Jvb3QgQ0EwHhcNMjIwOTA3MTYwMDA2WhcNNDIwOTAyMTYwMDMzWjA3MRAw\nDgYDVQQKEwdFaHRyYWNlMREwDwYDVQQLEwhDVCBTdGFjazEQMA4GA1UEAxMHUm9v\ndCBDQTCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoCggIBAPM1FMILPUt+cLaQ\n5/+dwNdEwRWEW+E5XksMyp7jkGwM1KecQCFrEyLX32xnRrYvciP64H+cp8ed8oqg\nxWxnEu4OKWaJYwSykqhZFU/FHYNpCfTu+obk7D21CEWJVGKN+mxD05TvP7aWSeyY\nIHOTA32t6qGOjMs2HsF16s8Apn+dW6l8iM/yTalgwIDFkZmF3dGQmy7huNIc4rVA\nPiNiVCMx6F5N4D8Jvs6keLv8nZqgcc+m8m5nsR3p8P1f+MI9ywfb/opXHTpkGQD2\nbyb00RfhB+9eIvCmm3SYreC0p+YwjNaGYprVp7IRgeVP6HCIQE/9uBvAgfBru9nS\nUrw+UBxWOdO1LHEjKN8IBH9XKXS2cV9geJInUZQTxFVbqwyp1iPuLu48KFgFNaRV\nVDu97xHOHVImgdbADmg13ti4RJqZzoSSgySq6m3rYT01FyDV3/WsG4HMAPt9kFPu\nTyEbh5H0Iq/Ukus+OQa0I+pUeyMLRG20jGB2iIDxDRPpuonhsl5t5kEDj2l4Yn4k\nqiJaOunfpfOu662zB2LpdAH6SAB26OmPParFJwHsa21CHMUhxnBTp9N22xurlW9K\nquPuT69Sie7X6pZSeXyHbzXHIdLhCP63fXC1z/b6EnMT/5jPeHHSZyMJEyhAb0cJ\nnm9f5yXfyNrfSJIaOJzwgDBuh4yDAgMBAAGjYzBhMA4GA1UdDwEB/wQEAwIBBjAP\nBgNVHRMBAf8EBTADAQH/MB0GA1UdDgQWBBRrEG976XowRmiXWPxAGlllE8r7sDAf\nBgNVHSMEGDAWgBRrEG976XowRmiXWPxAGlllE8r7sDANBgkqhkiG9w0BAQsFAAOC\nAgEAvGGqiEIGzippTD8aP3kUaJLQYdD7kRpctLVu9q4BwDqh8FJdzGcaomWT0yxo\nuPTk/OkZSE+e+3AbqAh2wdzvjlxfD0+6FeCCWw2DeZI+l3d/xrBLfWd05bJ/w+A5\ny9J+HoxkepkXab0ABEBxQWwK4SHQrAN/wzF/3OG/HDN8hKa6Yeq1iRxdSJQIdRU3\nYpzI+aq9lxb7/fsVf94K0MIb60LIy30CW8LlM9nzqguGvpgRtHJYjP2xnbfDXVCR\n+gN/Y5M/NqTUwUhs7aykzCxuP0bQQtF0tFwheMizoQo5ujMPhch88rcyaZRgjG1y\nxPfICkbwU5NiJQZ0n0r6s/fwXCS0VXdPzNoXYCDqm1tJrZ6WukNrzGwA9JlZbrN1\nnGdo2z5QR3rfavi8g5PnAV/PR9meGJWd3S1f9ppbMmZ7FG1ZsFFbcJ69Q1E3M3Uk\nnCO0TEyiivxD+Om9LZGwi1cPnWt4twR3HRQHsSOsPzUYgs4KSqF4c0m294si0fKB\nSpD4pC2Wcq5xQGewSKSNKdx1t2k5fP5EvMhKqH6u9lDZUwRgvizGDFpPnOf76uNp\nxHoQU70JbtVm+IkwMqf1R7xpuqrCc/jDXTtoj9eqpg5B8rfJvL5UWKvebdALRRMA\n7CFHuilx6aVZ/gqcpkR21QhlTR15NhTvxQwfCgOVt6R3MeQ=\n-----END CERTIFICATE-----\n",
  "JWKSURL": "https://nomad.service.ct-poc.ehtrace.com:4545/",
  "JWTSupportedAlgs": [
    "RS256"
  ]
}

The policy (policy-nomad-tasks), the role (nomad-default-tasks) and the binding rules has been created successfuly

The JWT endpoint is working, as I’m using it for vault workload identities with success.

Only Consul workload identities are broken as I can get everything working by simply commenting the service_identity and task_identity sections in my consul server config

dbd · October 8, 2024, 7:41am

Still struggling with this. I’ve opened this bug (cause I think there’s a bug somewhere). But one interesting thing I’ve noticed : the issue only appears if I submit a job in a nomad namespace. When submiting in the default namespace, it’s working, but in another (in my case, I work in a namespace named “poc”), I get the error.
As I’m using community versions, there’s no namespace in Consul (nor in vault BTW), only in Nomad

dbd · October 8, 2024, 11:12am

Finally found the issue, which was on my side.
The problem comes from the default binding-rules created by nomad setup consul

c0c8a4d6-c0dd-58da-ac1c-832362ca3225:
   AuthMethod:   nomad-workloads
   Description:  Binding rule for Nomad tasks authenticated using a workload identity
   BindType:     role
   BindName:     nomad-${value.nomad_namespace}-tasks
   Selector:     "nomad_service" not in value

This binding-rule uses the nomad namespace to associate a role. But only the nomad-default-tasks role is created by default, so only the default nomad namespace can be used. Changing this binding-rule to hardcode nomad-default-tasks (instead of using ${value.nomad_namespace}) makes everything working

system · November 7, 2024, 11:12am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Migration to Workload Identity Nomad consul , nomad	0	63	November 28, 2024
Envoy proxy "Permission denied: anonymous token lacks permission" with workload identity Nomad connect , consul	0	358	January 8, 2024
Terminating gateway broken with workload identities Nomad consul	3	31	October 1, 2024
Nomad with Vault Workload Identity permission denied on 1 client Nomad	0	68	September 5, 2024
How to get VAULT_TOKEN in Nomad job? Nomad vault , consul-vault , consul	0	180	April 24, 2024

Workload identities for consul tokens

Related topics