ACLs in Production, Done the Right Way

Hello,

While I have spent some time to understand ACLs, and I have some ideas on how to do this, I am curious what other people are doing for ACLs in production.

In general, I’m used to my clusters coming online mostly on their own. I give them some initial bootstrap parameters, maybe generate some TLS certs, and then let the tools do the work. I’m interested in getting Nomad/Vault using Consul ACLs correctly, and so this means bootstrapping Consul’s ACLs needs to be in this process too.

I’m seeing 3 main problems:

  1. The need to start/stop/restart consul on multiple hosts, one at a time, after configs are updated, and in the right order / not proceeding until the last node is back online.
  2. Capturing tokens from the API to feed into node init on other nodes.
  3. Giving each node a unique token.

In my clusters, vault and nomad depend on Consul, so Consul is first to bootstrap, and first in the dependency chain, however, I do wonder - can we bootstrap Consul agents using tokens from Vault, which has the consul secret backend configured?

It also seems the bootstrap process is sufficiently complex enough that I would guess most deployments use simple and open policies, with mostly manual deployment/configuration, and/or long-lived tokens that aren’t ever reset in the cluster. If anyone has “done it right”, they probably wrote a suite of scripts/tools to do the hardwork for them. Maybe even a bot that understands the Vault/Nomad/Consul agents/configs and APIs to orchestrate ACLs across all of them…

Another question I’ve been wondering about is: to what extent can we tell consul what to use for the initial token bootstrapping? Would it make sense to start the consul cluster with a known Token in a way that reduces the amount of work, and makes it easy to “rotate out” that token used for init?

And yet another question: if Vault has been configured with the Consul secret backend, and can give out tokens to consul, can the Nomad agent use that token, or do we need to give the Nomad agent a more long-lived token?

Maybe you have thoughts, insights, or ideas you would like to share?

I’m hitting the same apparent wall. I am writing all of this terraform/cloud-init and then there is this pause in the process to manually shuffle some bits around.

I feel like there is something I am missing or just don’t understand about the process.

Not really tested but I’m going to try this approach:

  1. Add an ExecStartPost script that will run consul acl bootstrap (which will error if it has already bootstrapped).
  2. Parse out the token and shuffle it off somewhere (Azure KeyVault right now) where other code can pull it out and give it to agents.

I don’t know what would happen if two servers tried to bootstrap at the same time. I could set it up so only the first VM gets the flag to bootstrap, but I’d have to disable over-provisioning to make sure someone tries to bootstrap.

Easy part:

tempfile=$(mktemp)

if consul acl bootstrap >"${tempfile}"; then
  CONSUL_HTTP_TOKEN=$(grep '^SecretID:' "${tempfile}" | sed -r 's/SecretID: +([a-z0-9-]+)/\1/')
fi

# curl it up somewhere