I’m planning a deployment in which Consul will run in two autoscaling groups: a “control plane” group whose nodes run the Consul server agents, and a separate working ASG where nodes run Consul client agents coordinating with the control plane agents.
I have this setup working, and I can terminate nodes in either ASG or mark them unhealthy and have new, replacement nodes come up successfully and join the cluster. However, for this prototype, the only security present is firewall (security group) rules - traffic from outside of the subnet cannot reach the consul cluster - and the gossip key for the server nodes.
I’m looking at using the ACL system to lock down access, but I can’t figure out the intended path for unattended bootstraps. In particular, I can’t figure out how to time the creation of the necessary ACL policies so that I can generate node tokens, while the control plane cluster is initially starting up (so while there may not be enough nodes yet to form a quorum and start the cluster).
Here are the constraints:
- I can’t have any human steps involved in provisioning new nodes. Both autoscaling groups will be completely unattended.
- I don’t have Vault or similar configured (yet; it’s on my roadmap and if the answer is “solve that first” I won’t be offended).
- I do have more or less unfettered freedom to use other AWS services, such as secret stores, to distribute information between nodes.
- This has to work both during the initial part of a cluster’s life, when the number of nodes in existence is well below quorum, and during the normal working life of the cluster, when there is a quorum. It doesn’t have to work during loss of quorum after the cluster has started.
- Both clusters are created by Terraform, and there is presently no dependency between them to ensure that the control plane cluster is healthy before the general use cluster starts, but I can add one if controlling startup ordering makes this easier to solve.
Question 1: Is there a good pattern to use to consistently bootstrap the ACL system exactly once on a new cluster that does not rely on human intervention that avoids these problems?
Here’s what I’ve tried so far:
-
Providing an initial bootstrap token to nodes in the control plane group, which is set in
acl.token.master
. On the upside, this means that I know a priori a token that will work during the bootstrap process, and so long as nothing ever deletes that token, it can even be used as new consul server nodes join the cluster. On the downside, it also ends up stored elsewhere - terraform state, for example - and provides root access, so it’s a sensitive value. -
Checking AWS Secrets Manager for a management token, and bootstrapping the ACL system if one is not found. On the upside, this means that even I don’t know the management token - it gets stored directly in Secrets Manager once the cluster is far enough along for one of the bootstrap attempts to succeed. On the downside, all of the nodes sit there spinning their heels in a blind loop trying over and over again to bootstrap the ACL system unless they happen to luck out and succeed. There doesn’t seem to be an easy way to determine that 1. the cluster is healthy enough to try or 2. whether the attempt failed because the cluster isn’t ready yet or because the ACL system has already been bootstrapped, at least that’s usable from the CLI.
Question 2: Assuming for the moment that I’m creating a default read-only policy for use by node tokens, and that each node token will additionally have an appropriate node ID, is there a good way to issue these tokens?
What I’ve tried so far involves having each node, on startup, issue itself a node token using the bootstrap token. This means that, in principle, every node has access to a credential that obviates the whole ACL system, and promises only to use it momentarily during startup. While this at least constrains the issue of unauthorized access, it’s still pretty unsatisfying.
Question 3: As above, is there a good time to create the policy? Right now, with the approach using Secrets Manager, I’m creating the policy on whichever server node happens to successfully run consul acl bootstrap
, which works, but I can’t figure out if there’s a way to do it under the approach where the bootstrap token is preconfigured.