Unattended ACL bootstrapping in an Autoscaling Group

I’m planning a deployment in which Consul will run in two autoscaling groups: a “control plane” group whose nodes run the Consul server agents, and a separate working ASG where nodes run Consul client agents coordinating with the control plane agents.

I have this setup working, and I can terminate nodes in either ASG or mark them unhealthy and have new, replacement nodes come up successfully and join the cluster. However, for this prototype, the only security present is firewall (security group) rules - traffic from outside of the subnet cannot reach the consul cluster - and the gossip key for the server nodes.

I’m looking at using the ACL system to lock down access, but I can’t figure out the intended path for unattended bootstraps. In particular, I can’t figure out how to time the creation of the necessary ACL policies so that I can generate node tokens, while the control plane cluster is initially starting up (so while there may not be enough nodes yet to form a quorum and start the cluster).

Here are the constraints:

  • I can’t have any human steps involved in provisioning new nodes. Both autoscaling groups will be completely unattended.
  • I don’t have Vault or similar configured (yet; it’s on my roadmap and if the answer is “solve that first” I won’t be offended).
  • I do have more or less unfettered freedom to use other AWS services, such as secret stores, to distribute information between nodes.
  • This has to work both during the initial part of a cluster’s life, when the number of nodes in existence is well below quorum, and during the normal working life of the cluster, when there is a quorum. It doesn’t have to work during loss of quorum after the cluster has started.
  • Both clusters are created by Terraform, and there is presently no dependency between them to ensure that the control plane cluster is healthy before the general use cluster starts, but I can add one if controlling startup ordering makes this easier to solve.

Question 1: Is there a good pattern to use to consistently bootstrap the ACL system exactly once on a new cluster that does not rely on human intervention that avoids these problems?

Here’s what I’ve tried so far:

  • Providing an initial bootstrap token to nodes in the control plane group, which is set in acl.token.master. On the upside, this means that I know a priori a token that will work during the bootstrap process, and so long as nothing ever deletes that token, it can even be used as new consul server nodes join the cluster. On the downside, it also ends up stored elsewhere - terraform state, for example - and provides root access, so it’s a sensitive value.

  • Checking AWS Secrets Manager for a management token, and bootstrapping the ACL system if one is not found. On the upside, this means that even I don’t know the management token - it gets stored directly in Secrets Manager once the cluster is far enough along for one of the bootstrap attempts to succeed. On the downside, all of the nodes sit there spinning their heels in a blind loop trying over and over again to bootstrap the ACL system unless they happen to luck out and succeed. There doesn’t seem to be an easy way to determine that 1. the cluster is healthy enough to try or 2. whether the attempt failed because the cluster isn’t ready yet or because the ACL system has already been bootstrapped, at least that’s usable from the CLI.

Question 2: Assuming for the moment that I’m creating a default read-only policy for use by node tokens, and that each node token will additionally have an appropriate node ID, is there a good way to issue these tokens?

What I’ve tried so far involves having each node, on startup, issue itself a node token using the bootstrap token. This means that, in principle, every node has access to a credential that obviates the whole ACL system, and promises only to use it momentarily during startup. While this at least constrains the issue of unauthorized access, it’s still pretty unsatisfying.

Question 3: As above, is there a good time to create the policy? Right now, with the approach using Secrets Manager, I’m creating the policy on whichever server node happens to successfully run consul acl bootstrap, which works, but I can’t figure out if there’s a way to do it under the approach where the bootstrap token is preconfigured.

1 Like

Could you, please:

  1. Join the discord channel for shipyard Discord

  2. Post a link to this question and let @nic know you would be interested in his upcoming video series that touches on these topics?

This would help not only provide feedback but help him understand usecases

Hey @owen,

Generally What I do in this instance is the following,

For the server, generate the master token manually and store it in a secrets manager, when the server node in the autoscale boots it retrieves the token and can then join as a server with the correct permissions. I take the same approach for the TLS certs needed to join. This way you can generate a bundle of the necessary certs, gossip key, token and make them available to the server in the autoscale group when they init.

For the client autoscale groups, I would recommend using the JWT auth methods,

You can use the JWT available with AWS metadata to retrieve the token for the client with the correct permissions it needs. The auth method can be configured to only issue a token that has the correct claims, and these can be configured by AWS IAM. Alternatively, if you are using the latest version you can use the auto_configure feature that works in a similar way to auth methods but will bootstrap the agent will all ACL tokens, Gossipe and TLS.

Configuration | Consul by HashiCorp

1 Like

Really appreciate the input @nic . Thank you!

Thanks @nic. This is extremely helpful, and more or less confirms I at least understand the problem adequately. I’ll read through the JWT auth mechanism and try to understand how that plays out in an autoscaled universe.

Do I understand correctly that the secrets-manager-ed “master token” is effectively the value of acl.token.master, and that its value is stable over the life of the cluster?

That is correct,

You can add an ACL block like this to your config file, just generate the token is a guid and can be generated with a number of different tools. Note that in addition to the master token you also need the agent token, this is for the consul agent running inside the server to operate correctly. You can set these to the same value.

acl {
  enabled = true
  default_policy = "deny"
  down_policy    = "extend-cache"
  enable_token_persistence = true

  tokens {
    master = "00000000-0000-0000-0000-000000000000"
    agent   = "00000000-0000-0000-0000-000000000000"
  }
}

Generally I keep my core consul config in one file and the acl config in another. Consul will merge these configs on start when using the -config-dir flag. I find this makes my init scripts a little bit dryer.

1 Like

For the server, generate the master token manually and store it in a secrets manager, when the server node in the autoscale boots it retrieves the token and can then join as a server with the correct permissions.

Using AWS Tooling, this might translated to storing the certs/gossip key/consul token in AWS Secrets Manager. I think you’d then need to attach a profile (and policies) to the AWS Launch Template that allows the instance to access your secret stored in Secrets Manager. I believe this just let’s the instance do things like: aws secretsmanager get-secret-value --etc... which is used to pull the secret things and stuff them in the right configs for consul to use.

I think?

Okay, so assuming that’s all true. Can’t anything running on that instance, at any point in the future, execute aws secretsmanager things and gain access to the secrets?

I realize that’s all framed in AWS context, but, the same problem would exist no matter what “secret manager” was used, no?

I’m curious how/if you worked around this issue.

Cheers!