ConnectCA.Sign RPC call Permission denied

Hello guys,

I am trying to introduce Consul Connect into our company but I am struggling with strange error with ConnectCA.Sign RPC call which I am unable to find much information about.

A bit of history of our deployment:

  • we already had two Consul clusters running at v1.9.2, with ACL enabled which we used mainly for Service Discovery and Vault storage (everything worked OK)
  • everything is deployed manually (or with puppet), but no orchestrator, Kubernetes or similar things are involved
  • upgraded to v1.9.8 and further to v1.10.1
  • deployed 2 envoy mesh gateways into each datacenter and joined one to the other with WAN federation via mesh gateways (I don’t have direct connection between those two)

Now I am trying to deploy some POC with Consul Connect, but I am unable to get it working. I am afraid I might have done something wrong when initializing Consul Connect, but everything I checked looks like I believe it should (e.g. Connect CAs in both datacenters).

Also ACL tokens and policies looks fine to me, it’s pretty same like setup with DNS Service Discovery (which is working without any issues), just another policy for sidecar proxy service must be defined. Token is defined in configuration file as Default and envoy proxy don’t have its own token at all.

Agent token’s policies:

node_prefix "" {
  policy = "read"
}

service_prefix "" {
  policy = "read"
}

node_prefix "ceph1n" {
  policy = "write"
}

agent_prefix "ceph1n" {
  policy = "write"
}

service "ceph1" {
  policy = "write"
}

service "ceph1-sidecar-proxy" {
  policy = "write"
}

When I reload consul agent I got following logs

2021-08-10T13:25:58.380+0200 [DEBUG] agent: Node info in sync
2021-08-10T13:25:58.380+0200 [DEBUG] agent: Service in sync: service=ceph1
2021-08-10T13:25:58.380+0200 [DEBUG] agent: Service in sync: service=ceph1-sidecar-proxy
2021-08-10T13:25:58.380+0200 [TRACE] agent.proxycfg: A blocking query returned; handling snapshot update: service_id=ceph1-sidecar-proxy
2021-08-10T13:25:58.380+0200 [TRACE] agent.proxycfg: A blocking query returned; handling snapshot update: service_id=ceph1-sidecar-proxy
2021-08-10T13:25:58.381+0200 [TRACE] agent.proxycfg: A blocking query returned; handling snapshot update: service_id=ceph1-sidecar-proxy
2021-08-10T13:25:58.383+0200 [ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.130:8300 error="rpc error making call: rpc error making call: Permission denied"
2021-08-10T13:25:58.383+0200 [DEBUG] agent.router.manager: cycled away from server: server="consul3 (Addr: tcp/192.168.1.130:8300) (DC: tower)"
2021-08-10T13:25:58.383+0200 [WARN]  agent.cache: handling error in Cache.Notify: cache-type=connect-ca-leaf error="rpc error making call: rpc error making call: Permission denied" index=0
2021-08-10T13:25:58.383+0200 [TRACE] agent.proxycfg: A blocking query returned; handling snapshot update: service_id=ceph1-sidecar-proxy
2021-08-10T13:25:58.383+0200 [ERROR] agent.proxycfg: Failed to handle update from watch: service_id=ceph1-sidecar-proxy id=leaf error="error filling agent cache: rpc error making call: rpc error making call: Permission denied"
2021-08-10T13:25:58.385+0200 [ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.128:8300 error="rpc error making call: rpc error making call: Permission denied"
2021-08-10T13:25:58.385+0200 [DEBUG] agent.router.manager: cycled away from server: server="consul2 (Addr: tcp/192.168.1.128:8300) (DC: tower)"
2021-08-10T13:25:58.385+0200 [WARN]  agent.cache: handling error in Cache.Notify: cache-type=connect-ca-leaf error="rpc error making call: rpc error making call: Permission denied" index=0
2021-08-10T13:25:58.385+0200 [TRACE] agent.proxycfg: A blocking query returned; handling snapshot update: service_id=ceph1-sidecar-proxy
2021-08-10T13:25:58.385+0200 [ERROR] agent.proxycfg: Failed to handle update from watch: service_id=ceph1-sidecar-proxy id=leaf error="error filling agent cache: rpc error making call: rpc error making call: Permission denied"
2021-08-10T13:25:58.386+0200 [ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.100:8300 error="rpc error making call: Permission denied"
2021-08-10T13:25:58.386+0200 [DEBUG] agent.router.manager: cycled away from server: server="consul1 (Addr: tcp/192.168.1.100:8300) (DC: tower)"
2021-08-10T13:25:58.386+0200 [WARN]  agent.cache: handling error in Cache.Notify: cache-type=connect-ca-leaf error="rpc error making call: Permission denied" index=0
2021-08-10T13:25:58.386+0200 [TRACE] agent.proxycfg: A blocking query returned; handling snapshot update: service_id=ceph1-sidecar-proxy
2021-08-10T13:25:58.386+0200 [ERROR] agent.proxycfg: Failed to handle update from watch: service_id=ceph1-sidecar-proxy id=leaf error="error filling agent cache: rpc error making call: Permission denied"
2021-08-10T13:25:58.387+0200 [ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.130:8300 error="rpc error making call: rpc error making call: Permission denied"
2021-08-10T13:25:58.387+0200 [DEBUG] agent.router.manager: cycled away from server: server="consul3 (Addr: tcp/192.168.1.130:8300) (DC: tower)"
2021-08-10T13:25:58.387+0200 [WARN]  agent.cache: handling error in Cache.Notify: cache-type=connect-ca-leaf error="rpc error making call: rpc error making call: Permission denied" index=0
2021-08-10T13:25:58.387+0200 [TRACE] agent.proxycfg: A blocking query returned; handling snapshot update: service_id=ceph1-sidecar-proxy
2021-08-10T13:25:58.387+0200 [ERROR] agent.proxycfg: Failed to handle update from watch: service_id=ceph1-sidecar-proxy id=leaf error="error filling agent cache: rpc error making call: rpc error making call: Permission denied"
2021-08-10T13:25:58.394+0200 [INFO]  agent: Synced check: check=service:ceph1-sidecar-proxy:2
2021-08-10T13:25:58.394+0200 [DEBUG] agent: Check in sync: check="Prometheus metrics endpoint"
2021-08-10T13:25:58.394+0200 [DEBUG] agent: Check in sync: check=service:ceph1-sidecar-proxy:1
2021-08-10T13:25:58.394+0200 [DEBUG] agent: Node info in sync
2021-08-10T13:25:58.394+0200 [DEBUG] agent: Service in sync: service=ceph1-sidecar-proxy
2021-08-10T13:25:58.394+0200 [DEBUG] agent: Service in sync: service=ceph1
2021-08-10T13:25:58.394+0200 [DEBUG] agent: Check in sync: check="Prometheus metrics endpoint"
2021-08-10T13:25:58.394+0200 [DEBUG] agent: Check in sync: check=service:ceph1-sidecar-proxy:1
2021-08-10T13:25:58.394+0200 [DEBUG] agent: Check in sync: check=service:ceph1-sidecar-proxy:2

When I start envoy proxy, it just stays there in idle with just admin port 19000 open, waiting to be configured.

I have run out of ideas how to troubleshoot this. I have test environment in docker containers I used to test WAN Federation where everything works out of the box like a charm and I don’t see any differences in configuration.

The only (insignificant, I belive) different thing is that in deployed environment we generated Consul certificates (not Connect CA) with Vault instead of using commands

consul tls ca create
consul tls cert create ...

Any help, idea or different point of view is appreciated. I am quite stucked right now.

Thank you

I am still having this issue, so literally ANY idea will be appreciated.

Already tried following:

  • It’s failing in primary and secondary datacenter
  • Switch ConnectCA to Vault provider, even turned on Storing certificates in leaf-cert role, but no certificate was generated
  • tried to assign global-management policy to default token on client, but nothing changed
  • added missing localhost, 127.0.0.1 and client.tower.consul SANs into client agent TLS certificate (connection to Consul cluster and Service Discovery worked fine without it)
  • firewall is not in the way, I can see encrypted communication between agent and servers without any TCP timeouts or resets

I also tried to look into source code, but it wasn’t much helpful. I am not familiar with consul source code, I am just trying to grep it for error messages and see where it leads me to. I figured out (= guessed) that:

Any clues from source code lead me to wrong ACLs, but I cannot understand it since service is properly registered (=it has write policy) with single instance, just Connect Leaf certificate is not Signed and Envoy is not configured.

In lab environment everything worked (almost) on first try and I really don’t want to delete everything and bootstrap both consul clusters because of this.