Workload identity and custom vault policy

Hi.
Just reading the new workload identify features in Nomad 1.7. While it seems interesting, it also seems far more complicated to manage than the old way. So it’s a bit sad to see it deprecated and planned to be removed in Nomad 1.9 :frowning:

The doc about how the workload identity is still a bit unclear. We can define templated vault policy all workload will inherit, like

path "secret/data/{{identity.entity.aliases.auth_jwt_3a9350fe.metadata.nomad_namespace}}/{{identity.entity.aliases.auth_jwt_3a9350fe.metadata.nomad_job_id}}/*" {
  capabilities = ["read"]
}

(BTW, this is already a pain to manage automatically as we have to lookup the accessor ID of the JWT auth before writing the policy template). But, how to grant additional, custom vault policy to a specific task in a job ?

But, how to grant additional, custom vault policy to a specific task in a job ?

Hi @dbd! The way to solve this is to use a different vault.role for the jobs that need to use a different set of policies. Otherwise it’ll use the default_role from the auth method.

So, for every job, I’d now need to create my custom vault policy (as before), but also a new token role to map this policy ? This looks like a huge usability regression (or maybe I misunderstood)

@dbd thanks for the quick feedback.

We’re broadly committed to this direction to reduce the overhead and security risk around manual/static token management, but we do want to ensure that by the time people have to adopt this its a better UX.

Two things that we could use some feedback on (from anybody reading this not just @dbd )

  • How far can you get with Templated Policies? If you can have a single role attached to one of these (or a few pre-created roles), then ideally you don’t even have to go and create new policies for new jobs. - I realize this probably forces you to impose a more rigid structure on secrets paths. Is that a deal breaker?
  • How painful is role creation when you can’t use an existing templated policy/role? It sounds like this answer is “pretty painful” in your case?

We don’t have this functionality yet (and is a speculative idea at this point), but if we provided a way to pass thru identity data to roles, and then the roles could “bind” to specific policies based on that, could that work? So some role grants access to any policy named “nomad--” or something like that.

While templated policies could work for I’d say ~50% of my workloads (I’m already using a structured hierarchy for my KV secrets, I’ll probably have to move a few things around, but still doable). But for the other ~50%, I need specific vault policies (issue certs on a specific PKI, get a consul or nomad token, read secrets from another workload etc.). For those, it’s too specific to be able to rely on a few templated policices. I currently automate the creation of all those custom policices (well, I write it manually, only the vault write is automated), and can easily add this to a task with

vault {
  policices = ["mycustompolicy"]
}

As Nomad servers are able to issue tokens with any policy attached (except a few high privileged which I have blacklisted), there’s nothing more to do.

I can probably work on automating the role creation, but it’s one more step, one more indirection (harder to understand and debug), without bringing any advantage (as I will have to create one role for each of those custom policices). It just seems less practical than the old way.

Indeed, if there was a way to somehow pass the policy I want as a metadata attached to the identity, so that the role could map specific policies to the final token, it’d be helpful (in fact, that’d emulate the old way of managing policies :wink: )

I’m very late to this party, but we’re facing the same issues. For us, templated policies will work for maybe 10% of our workloads (and I may be generous there) due to the way we have set up vault way back when. We can use a templated policy for each workload to access their own KV store data (with a bit of renaming here and there - also maybe a fun thing for Vault, be able to rename keys instead of having to recreate/delete) but other than that we’re stuck basically converting our existing policy lists per workload into a separate role.

For instance, all our db credential endpoints (roles, if you will) have their own policy. This allows us to compose a policy list in the line of [ “db:a”, “db:b”, “db:c_ro” ] to easily allow an application access to those particular ones. But that list changes per workload. Similarly for PKI endpoints, and other credential generating endpoints (rabbitmq comes to mind).

So now I need to create a role that encapsulates those policies, on Vault, so I can pass it to the job in Nomad. This means that for us the benefit of this mechanism is pretty much 0 because we’re still doing what we’re doing now except now there’s 1 more layer of indirection as @dbd mentioned.

The only benefit from workload identities for us is that it can replace the sort-of-similar-but-PKI-based app we built for it a while back.

There are also still very many unclear items in the documentation, i.e. setting up the jwt auth endpoint on Vault requires giving the URL to the JWKS store on the Nomad server, but if we have 4 federated clusters, does each cluster individually need an endpoint or can I use a single endpoint and “it will just work” - because if it’s the former, then I can’t move workloads between clusters if need be because workload identity will be pulling from the wrong auth endpoint. Our clusters are regionally based, under the assumption that if need be we can migrate an entire regions’ allocations to another region if the proverbial kaka hits the fan - ideally without having to create roles in 4 different endpoints because then this setup would be a complete and total regression.

I really hope hashicorp change their mind and do not deprecate the old way of managing token, at least for now. At least, I’m glad I’m not alone finding the workload identity a lot more complexe to manage for nearly no benefit

Also, discussion on this bug kinda worries me : is that true that vault JWT auth does not support mTLS, and so, it implies disabling mTLS on Nomad API to be able to use workload identities ?

I mean, I get the appeal - if you’re starting from scratch it’s a very nice system, and as I mentioned, we could very well use it in our own infra to authenticate apps to eachother (we have a PKI based thing for it now that is … well… it works but can’t say it’s pretty). But for an existing cluster (especially the part where we have 4 federated clusters and nobody so far has seemingly been able to explain how that’d work with that JWKS endpoint) it seems like it’s just a massive pain in the dingaling. I mean, I’m all for upgrading but at the rate things are going now I don’t think we’ll be going past 1.8 if the documentation/feature implementation stays the way it is.

That one worries me because we do have mTLS on the API, and we don’t have ACL’s enabled due to the somewhat organic way we ended up adopting Nomad. Enabling ACL’s now is going to be… exceedingly tricky to pull without potential downtime.

Generally speaking at this rate it’s probably easier for us to just start embedding vault approle credentials into our apps and letting them sort it out on their own than to have Nomad do it.

I use both mTLS and ACL, and I’d prefer keeping using both. I could still setup some nginx reverse proxy, private to my vault nodes, just to handle the “unTLS” to the Nomad API, but that’s a new potential point of failure