Hi everybody,
in our company we have troubles in making work Consul proxies.
We have Vault 1.4.1, Consul 1.8.0 and Nomad 0.12.1.
We use Vault for secrets managements and pki. We then setup Consul for using CA root and intermediate from Vault. This is the Vault pki creation command list:
vault secrets enable pki
vault secrets tune -max-lease-ttl=87600h pki
vault write -field=certificate pki/root/generate/internal common_name="consul" ttl=87600h
vault write pki/config/urls issuing_certificates="http://127.0.0.1:8200/v1/pki/ca" crl_distribution_points="http://127.0.0.1:8200/v1/pki/crl"
vault secrets enable -path=pki_int pki
vault secrets tune -max-lease-ttl=43800h pki_int
vault write -format=json pki_int/intermediate/generate/internal common_name="consul Intermediate Authority" alt_names="localhost,127.0.0.1" ip_sans="127.0.0.1"
vault write -format=json pki/root/sign-intermediate csr=@pki_intermediate.csr format=pem_bundle ttl="43800h"
vault write pki_int/intermediate/set-signed certificate=@intermediate.cert.pem
vault write pki_int/roles/consul allowed_domains="consul,127.0.0.1" allow_subdomains=true max_ttl="720h"
In the same hosts we have Consul and Nomad servers, with the below config:
Consul:
datacenter = "hcpoc"
data_dir = "/var/lib/consul"
log_level = "DEBUG"
node_name = "${node_id}"
bootstrap_expect = 3
retry_join = [
%{ for n in setsubtract(keys("${cluster_nodes}"), [node_id]) ~}
"${cluster_nodes[n]}:8301",
%{ endfor ~}
]
ports {
grpc = 8502
https = 8501
http = 8500
}
server = true
telemetry = {
prometheus_retention_time = "2s"
statsite_address = "127.0.0.1:2180"
}
acl {
enabled = true
default_policy = "deny"
enable_token_persistence = true
{{ with secret "consul/creds/consul-agent-role" }}tokens = { default = "{{ .Data.token }}" }{{ end }}
}
ui = true
client_addr = "0.0.0.0"
connect {
enabled = true
ca_provider = "vault"
ca_config {
address = "http://localhost:8200"
token = "/etc/consul.d/vault_token"
root_pki_path = "pki"
intermediate_pki_path = "pki_int"
}
}
cert_file = "/etc/consul.d/cert"
key_file = "/etc/consul.d/keyfile"
ca_file = "/etc/consul.d/ca"
verify_outgoing = true
verify_incoming = true
verify_server_hostname = true
Nomad:
datacenter = "hcpoc"
data_dir = "/var/lib/nomad"
log_level = "DEBUG"
server {
enabled = true
bootstrap_expect = 3
}
server_join {
retry_join = [
%{ for n in setsubtract(keys("${cluster_nodes}"), [node_id]) ~}
"${cluster_nodes[n]}:4647",
%{ endfor ~}
]
}
plugin "raw_exec" {
config {
enabled = true
}
}
consul {
address = "127.0.0.1:8501"
server_service_name = "nomad"
client_service_name = "nomad-client"
auto_advertise = true
server_auto_join = true
client_auto_join = true
ca_file = "/etc/consul.d/ca"
cert_file = "/etc/consul.d/cert"
key_file = "/etc/consul.d/keyfile"
ssl = true
verify_ssl = true
token = "{{ with secret "secret/data/consul/nomad_server_token" }}{{ .Data.data.token }}{{ end }}"
}
tls {
http = true
rpc = true
ca_file = "/etc/consul.d/ca"
cert_file = "/etc/consul.d/cert"
key_file = "/etc/consul.d/keyfile"
verify_https_client = false
#verify_server_hostname = true
}
telemetry {
collection_interval = "1s"
disable_hostname = true
prometheus_metrics = true
publish_allocation_metrics = true
publish_node_metrics = true
}
The job that we deploy on nomad is this one:
job "tester" {
datacenters = ["hcpoc"]
group "curl" {
network {
mode = "bridge"
port "http" {}
}
service {
name = "curl-service"
port = "8080"
connect {
sidecar_service {
proxy {
upstreams {
destination_name = "echo-service"
local_bind_port = 8080
}
}
}
}
}
task "curl" {
driver = "docker"
config {
image = "<image_repo>"
command = "sleep"
args = ["10000000"]
}
}
}
group "echo-server" {
network {
mode = "bridge"
port "http" {}
}
service {
name = "echo-service"
tags = [ "web" ]
port = "http",
connect = {
sidecar_service {}
}
check {
type = "http"
port = "http"
path = "/health"
interval = "30s"
timeout = "2s"
}
}
task "echo" {
driver = "exec"
config {
command = "local/echo-server"
}
env {
PORT = "${NOMAD_PORT_http}"
}
artifact {
source = "gcs::<bucket-url>",
options = {
token = "<token>"
},
destination = "local/"
}
}
}
}
The proxies are deployed and start running.
We noticed that the consul show us the proxies health check failed with
dial tcp 127.0.0.1:27814: connect: connection refused.
Then we investigated and in consul (of the node where proxies are deployed) logs we found this:
2020-07-27T10:29:16.833Z [WARN] agent: grpc: Server.Serve failed to complete security handshake from "127.0.0.1:51200": remote error: tls: unknown certificate authority
Nomad shows this:
client.alloc_runner.runner_hook: error proxying to Consul: alloc_id=52406db1-7394-f56e-b1fd-1c7495c4ed26 error="readfrom tcp 127.0.0.1:52114->127.0.0.1:8502: splice: connection reset by peer" dest=127.0.0.1:8502 src_local=/var/lib/nomad/alloc/52406db1-7394-f56e-b1fd-1c7495c4ed26/alloc/tmp/consul_grpc.sock src_remote=@ bytes=138
We also query the CA config with curl -s -H "X-Consul-Token: <token>" http://localhost:8500/v1/connect/ca/roots | jq .
and noticed only one CA root, with CA cert that is different from the one created in Vault.
What are we missing?
Thank you
Francesco