Consul production ready config

hey all,

so, I’m trying to get a production ready consul config setup, but it seems like consul has added a lot of functionality since the last time I looked at it, and the docs are not real clear on what a production ready config should look like. (same for nomad and vault these days, really)

does Hashi have example, production-ready configs, that include acl policies and such? because following the docs as is doesn’t seem to work, I’m seeing some weird things. for instance, here’s my agent token policy:

acl = "read"

agent_prefix "" {
  policy = "read"
}
node_prefix "" {
  policy = "read"
}
service_prefix "" {
  policy = "read"
}
session_prefix "" {
  policy = "read"
}

node "{{ grains.host }}" {
  policy = "write"
}
agent "{{ grains.host }}" {
  policy = "write"
}

the datacenter deployment guide shows much more open permissions, and the access control setup tutorial shows even less permissions, but says that for production usage you should have “exact-match node rules”, so that’s what I’m attempting to do here.

however, when I use this policy attached to my agent token, I get the following errors in my logs, and I things like service registration using CONSUL_HTTP_TOKEN=initial-management-token consul services register something.hcl don’t work, even though the command returns without error.

Feb 03 00:57:47 node-0 consul[3154926]: 2023-02-03T00:57:47.764Z [ERROR] agent.http: Request error: method=GET url=/v1/acl/policy/name/agent from=127.0.0.1:56756 error="ACL not found"
Feb 03 00:57:47 node-0 consul[3154926]: agent.http: Request error: method=GET url=/v1/acl/policy/name/agent from=127.0.0.1:56756 error="ACL not found"
Feb 03 00:57:48 node-0 consul[3154926]: 2023-02-03T00:57:48.116Z [ERROR] agent.http: Request error: method=GET url=/v1/acl/policy/name/readonly from=127.0.0.1:56774 error="ACL not found"
Feb 03 00:57:48 node-0 consul[3154926]: agent.http: Request error: method=GET url=/v1/acl/policy/name/readonly from=127.0.0.1:56774 error="ACL not found"
Feb 03 00:57:48 node-0 consul[3154926]: 2023-02-03T00:57:48.813Z [ERROR] agent.http: Request error: method=GET url=/v1/acl/policy/name/nomad from=127.0.0.1:56794 error="ACL not found"
Feb 03 00:57:48 node-0 consul[3154926]: agent.http: Request error: method=GET url=/v1/acl/policy/name/nomad from=127.0.0.1:56794 error="ACL not found"
Feb 03 00:57:49 node-0 consul[3154926]: 2023-02-03T00:57:49.707Z [ERROR] agent.anti_entropy: failed to sync remote state: error="ACL not found"
Feb 03 00:57:49 node-0 consul[3154926]: agent.anti_entropy: failed to sync remote state: error="ACL not found"

so my question is: what does a production agent ACL policy look like, and where is it documented? and is there a similar documentation for basically all of a standard consul + nomad cluster?

I’ve had good experiences with other hashi stuff for a long time (terraform, packer, and even nomad in single-host configs without consul), but consul just seems incredibly opaque and incorrectly documented, so I’m hoping somewhere out there just has a working production ready example which illuminates all the issues I’ve been having.

thanks!

In my experience, ACL not found is a confusing way of saying the ACL token supplied in the request is not known to the cluster, which leads me to believe that the content of the policy may not be the problem, but rather that the Consul agent hasn’t been supplied with an actually correct agent token.

I am in the middle of an evaluative rollout of Consul and have a lot of similar feelings. There is a lot of great stuff here but the documentation feels like a building blueprint that describes first all the windows, then the heating ducts, then the lights, later the foundation, etc :slight_smile: It probably makes perfect sense to someone who has deep understanding of the system already.

Let me share some of my setup so far in case it is helpful to you. I have also been targeting a production grade deployment. I don’t know if I am there yet so don’t assume this is all ideal or best practice. But, it is a working example with TLS and ACLs enabled.

There is something to keep in mind about the “docs” section, if you’re ever feeling like you can’t find something again that you’re sure you were looking at before. There is a nav level above “Documentation” that has, more documentation. Specifically, the API and CLI docs are not in the docs section:

Also, I can confirm that the tutorials have not kept pace with recent changes, especially with regard to TLS and gRPC.

Regarding your ACL question - I can’t speak to the syntax you are using, but Consul has a new concept of “node identities”. When a token is created with a “node identity”, it will have the appropriate permissions for that node to register itself with the cluster.

The rest of my ACLs config is at the bottom of this post, but honestly it’s not much more than using node identities.

Other setup

Here is the rest of my config and some reflections on things I found confusing at first:

/etc/consul.d/consul.hcl - on both servers and clients

datacenter = "oliver"
data_dir   = "/opt/consul"
encrypt    = "SECRET REDACTED"
bind_addr  = "{{ GetAllInterfaces | include \"network\" \"fdbc:6a5c:a49a:1005::/64\" | attr \"address\" }}"

enable_script_checks = true

addresses {
  http = "unix:///var/run/consul/consul_http.sock"
  grpc = "unix:///var/run/consul/consul_grpc.sock"
}

ports {
  grpc     = 8502
  grpc_tls = -1
}

unix_sockets {
  user  = "20003"
  group = "20003"
  mode  = "0666"
}

tls {
  defaults {
    verify_incoming = true
    verify_outgoing = true
    ca_file         = "/etc/consul.d/consul-agent-ca.pem"
  }

  internal_rpc {
    verify_server_hostname = true
  }

  grpc {
    use_auto_cert = false
  }
}

retry_join = ["REDACTED DNS NAME", "REDACTED DNS NAME", "REDACTED DNS NAME"]

performance {
  raft_multiplier = 1
}

acl {
  enabled                  = true
  default_policy           = "deny"
  enable_token_persistence = true
}

Comments:

Bind_addr is a go-sockaddr template. This is basically mandatory if using IPv6 ULAs beacuse the built-in logic will prefer GUAs, which is undesirable if the site is not using a PI address space. If you’re not using IPv6 you don’t need to care about any of that.

Addresses are unix sockets for the client interfaces. This is done because it is easy to expose a socket to a container as a bind mounted volume.

The gRPC stuff is really a rollercoaster right now in the docs. I can’t link all the relevant docs for this but essentially here was the journey.

(consul is version 1.14.4, envoy is version 1.24.1)

  • You must specify a grpc port to use Consul Connect
  • The grpc port property is deprecated when TLS is enabled, set grpc_tls for Consul Connect
  • Consul Connect built-in proxy is not production suitable, use Envoy
  • Envoy does not support Consul Connect gRPC with TLS, set tls.grpc.use_auto_cert = false to disable TLS for that listener
  • I set use_auto_cert = false and removed the grpc_tls port and added a grpc port
  • You cannot use ports.grpc when gRPC with TLS is enabled
  • Undocumented, but I figured out that ports.grpc_tls = -1 must be explicitly set to allow setting ports.grpc in the configuration. The -1 port disables that listener, otherwise it starts on the default port.
  • All this makes less sense with UNIX sockets, as there are no “listen ports” with sockets. But you need to put the ports in the config to get the gRPC TLS listener disabled, and the gRPC cleartext listener enabled.

/etc/consul.d/server.hcl - only on the three Consul server agents

server           = true
bootstrap_expect = 3

auto_encrypt {
  allow_tls = true
}

tls {
  defaults {
    cert_file = "/etc/consul.d/oliver-server-consul.pem"
    key_file  = "/etc/consul.d/oliver-server-consul-key.pem"
  }
}

connect {
  enabled = true
}

ui_config {
  enabled = true
}

ACLs

anon_read is assigned to the anon token (00000000-0000-0000-0000-000000000002)

node_prefix "" {
  policy = "read"
}

service_prefix "" {
  policy = "read"
}

The agent tokens have NO policy other than what they get from being created with -node-identity=<the node name> as linked above.

All together, this gets me a working cluster. Hope this helps!

2 Likes

I’ve definitely found a few cases where if the token is missing the permissions it needs, I get the ACL not found message, which is pretty unclear. I think you probably get that message in both cases (missing token, missing permissions on token), so it really doesn’t help in understanding what’s going wrong.

@hashi would love some better error messages in consul across the board, but especially on what permission is missing when a token can’t do the thing it’s trying to. ideally it would say exactly what permission it’s missing.

this is great, thanks @thaddeus. I’ll dig into that this week and see how much I can improve my configs!

the way I read this, you have to disable TLS to get Consul Connect to work. which sounds really insecure, and pretty much blocks adoption of Consul Connect for me at least. @hashi can you confirm consul is broken in this interesting and undocumented way?

so I think that token metadata gets attached to the token in question after the cluster is started, and saved in consul, since the tokens themselves are just UUIDs and don’t contain encoded metadata from what I can see. however, this makes bootstrapping the cluster annoying, since I can’t just put the agent token into the server config and start things up, like the docs imply, I’ve got to first start the cluster, then create the token with the node-identity as described here, and then assign it in the config and restart consul. seems like it makes seeding a token into the config pretty useless.

@hashi: what are the production steps to bootstrap a cluster with the least amount of effort? it seems like all the supposed time-saving things you’ve added don’t actually work if you care about cluster security?

unfortunately, doing the -node-identity thing for the token doesn’t seem sufficient, even with that, if I don’t set an ACL policy I get the following errors:

Feb 08 00:20:41 hostname consul[3506621]: 2023-02-08T00:20:41.313Z [ERROR] agent.anti_entropy: failed to sync remote state: error="ACL not found"
Feb 08 00:20:41 hostname consul[3506621]: agent.anti_entropy: failed to sync remote state: error="ACL not found"
[...]
Feb 08 00:20:56 hostname consul[3506621]: 2023-02-08T00:20:56.111Z [ERROR] agent: Coordinate update error: error="ACL not found"
Feb 08 00:20:56 hostname consul[3506621]: agent: Coordinate update error: error="ACL not found"

plus, I still can’t register service health checks with the management token if the agent token is set to this, so I don’t think -node-identity is the whole picture. although I imagine it is still required even though that’s not documented. puts me one step closer, I guess. thanks!

To my knowledge, this cleartext connection from Envoy to Consul is used to send configuration data and service info to Envoy on the local host. Consul tokens are transmitted over this connection though. But Envoy to Envoy connections that carry the proxied connections are still TLS. These Consul to Envoy connections are localhost connections and so have limited interception surface.

wouldn’t anything else using GRPC go over the actual network though, stuff that isn’t local to the consul server? I guess I don’t know how much that is, but I was under the impression that inter-server gossip is also GRPC, which would include a lot of sensitive info.