Hi Consul community,
I’m tearing my hair out (and I don’t have much to begin with) setting up a consul environment.
First a little background - I built a consul cluster previously and that is working and was deployed using ansible automation which was early in my use of this tool - while the deployment worked it was not idempotent and subsequent re-runs of the playbook would damage the cluster. Hence I started a rework of my ansible to make it smarter and idempotent, able to detect the state of the consul cluster and services and automatically run the required tasks. So a new consul cluster environment was born to carry out the refactoring of this ansible without wrecking the existing one.
The issues I am having:
- The cluster starts, a leader is elected, all looks merry. Bootstrap is successful and a node policy and token are created with the latter set across the cluster.
verify_[incoming|outgoing|server_names] all set to true, acl enabled and default policy is deny though I also tried allow.
Then I start the consul client agents which all fail to start with the error:
[ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.12.0.8:8300 error="rpcinsecure error making call: rpcinsecure error making call: Permission denied"
[ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.12.0.8:8300error="rpcinsecure error making call: Permission denied"
[ERROR] agent.auto_config: No servers successfully responded to the auto-encrypt request
- I constantly get the below warning in my logs:
[WARN] agent.server.rpc: a TLS certificate with a CommonName of server.vault-consul-dev.consul-dev is required for this operation: from=10.12.0.10:44612 operation="raft RPC"
My consul server configuration:
datacenter = "vault-consul-dev"
domain = "consul-dev"
node_name = "vault-consul-dev2"
data_dir = "/opt/consul"
encrypt = "insert encrypted stuff here"
ca_file = "/etc/consul.d/certs/consul-dev-agent-ca.pem"
cert_file = "/etc/consul.d/certs/vault-consul-dev-server-consul-dev-0.pem"
key_file = "/etc/consul.d/certs/vault-consul-dev-server-consul-dev-0-key.pem"
verify_incoming = true
verify_outgoing = true
verify_server_hostname = true
auto_encrypt {
allow_tls = true
}
# log_level = DEBUG
enable_syslog = true
retry_join = ["10.12.0.8","10.12.0.10",]
acl = {
enabled = true
default_policy = "deny"
down_policy = "extend-cache"
enable_token_persistence = true
token_ttl = "10s"
}
performance {
raft_multiplier = 1
}
The CA and server certificates are created using ‘consul tls’ commands specifying the domain and the datacentre appropriately and resulting in a server certificate with the CN and SAN as below:
Subject: CN=server.vault-consul-dev.consul-dev
X509v3 Subject Alternative Name:
DNS:server.vault-consul-dev.consul-dev, DNS:localhost, IP Address:127.0.0.1
I compare this with the cert on my previous working environment and they are the same albeit different datacentre and domain of course.
- I set verify_[incoming|outgoing] to false and default policy to allow, end result is failure to elect a cluster leader:
Sep 28 12:03:13 vault-consul-dev2 consul[21516]: 2022-09-28T12:03:13.339+1000 [WARN] agent.server.raft: Election timeout reached, restarting election
Sep 28 12:03:13 vault-consul-dev2 consul: 2022-09-28T12:03:13.339+1000 [WARN] agent.server.raft: Election timeout reached, restarting election
Sep 28 12:03:13 vault-consul-dev2 consul: 2022-09-28T12:03:13.340+1000 [INFO] agent.server.raft: entering candidate state: node="Node at 10.12.0.9:8300 [Candidate]" term=2039
Sep 28 12:03:13 vault-consul-dev2 consul[21516]: 2022-09-28T12:03:13.340+1000 [INFO] agent.server.raft: entering candidate state: node="Node at 10.12.0.9:8300 [Candidate]" term=2039
Sep 28 12:03:13 vault-consul-dev2 consul[21516]: 2022-09-28T12:03:13.348+1000 [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter b5b9e1e0-2c01-0625-3015-e63ce6e27c84 10.12.0.8:8300}" error=EOF
Sep 28 12:03:13 vault-consul-dev2 consul: 2022-09-28T12:03:13.348+1000 [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter b5b9e1e0-2c01-0625-3015-e63ce6e27c84 10.12.0.8:8300}" error=EOF
Sep 28 12:03:13 vault-consul-dev2 consul: 2022-09-28T12:03:13.348+1000 [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter a00e7b6e-9dc4-9e01-a385-464e22901d5f 10.12.0.10:8300}" error=EOF
Sep 28 12:03:13 vault-consul-dev2 consul[21516]: 2022-09-28T12:03:13.348+1000 [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter a00e7b6e-9dc4-9e01-a385-464e22901d5f 10.12.0.10:8300}" error=EOF
But set verify_server_hostnames to false also and it works. I don’t understand, why is it setting all of verify_[incoming|outgoing|server_names] to true allows the cluster to elect a leader.
However, the consul client agents still cannot start. Only by commenting out auto_encrypt will they start and join.
Now regarding ACLs…
node_policy:
agent_prefix "" {
policy = "write"
}
node_prefix "" {
policy = "write"
}
service_prefix "" {
policy = "read"
}
session_prefix "" {
policy = "read"
}
vault-agent-policy:
node_prefix "vault" {
policy = "write"
}
Am I correct hostnames starting with vault should therefore be able to write to the cluster?
Am I correct that when I start the cluster ACLs should be disabled until all servers and clients are joined and only then ACLs enabled? Is what is happening because with ACLs enabled when the clients are joining they cannot get their certificates from the cluster?
Currently my deployment process is:
- Common tasks on both vault and consul servers
- Install Consul on servers
- Generate a gossip key and store on ansible controller as an encrypted group var
- Deploy config, start service. Config has all validate_ params set to true and default policy deny
- Create certs and distribute across the cluster
- Identify the leader, set default policy to allow, restart and Bootstrap the cluster storing the created token as an encrypted group var on the ansible controller
- Create node policy and token and distribute across the cluster, set default policy to deny
- Create service agent policies and tokens
- Install consul client agents with CA cert and gossip key, start the service
Can anyone see where in this process I am going wrong? Obviously this is a cut down version of those events as I have a lot of checks and validations going on the see what the state of the cluster is ie whether bootstrapping is required or not and whether things already exist such as policies and tokens, whether they match those stored as encrypted group vars.
What am I doing wrong here please?