Consul (ssl, with Vault CA) + Nomad (tls): Envoy proxies returns "tls: unknown certificate authority"

Hi everybody,

in our company we have troubles in making work Consul proxies.

We have Vault 1.4.1, Consul 1.8.0 and Nomad 0.12.1.

We use Vault for secrets managements and pki. We then setup Consul for using CA root and intermediate from Vault. This is the Vault pki creation command list:

vault secrets enable pki
vault secrets tune -max-lease-ttl=87600h pki
vault write -field=certificate pki/root/generate/internal common_name="consul" ttl=87600h
vault write pki/config/urls issuing_certificates="http://127.0.0.1:8200/v1/pki/ca" crl_distribution_points="http://127.0.0.1:8200/v1/pki/crl"
vault secrets enable -path=pki_int pki
vault secrets tune -max-lease-ttl=43800h pki_int
vault write -format=json pki_int/intermediate/generate/internal common_name="consul Intermediate Authority" alt_names="localhost,127.0.0.1" ip_sans="127.0.0.1"
vault write -format=json pki/root/sign-intermediate csr=@pki_intermediate.csr format=pem_bundle ttl="43800h"
vault write pki_int/intermediate/set-signed certificate=@intermediate.cert.pem
vault write pki_int/roles/consul allowed_domains="consul,127.0.0.1" allow_subdomains=true max_ttl="720h"

In the same hosts we have Consul and Nomad servers, with the below config:

Consul:

datacenter = "hcpoc"
data_dir = "/var/lib/consul"
log_level = "DEBUG"
node_name = "${node_id}"
bootstrap_expect = 3
retry_join = [
   %{ for n in setsubtract(keys("${cluster_nodes}"), [node_id]) ~}
   "${cluster_nodes[n]}:8301",
   %{ endfor ~}
]
ports {
  grpc  = 8502
  https = 8501
  http  = 8500
}
server = true
telemetry = {
   prometheus_retention_time = "2s"
   statsite_address = "127.0.0.1:2180"
}
acl {
   enabled = true
   default_policy = "deny"
   enable_token_persistence = true
   {{ with secret "consul/creds/consul-agent-role" }}tokens = { default = "{{ .Data.token }}" }{{ end }}
 }
ui = true
client_addr = "0.0.0.0"
connect {
   enabled = true
   ca_provider = "vault"
   ca_config {
        address = "http://localhost:8200"
        token = "/etc/consul.d/vault_token"
        root_pki_path = "pki"
        intermediate_pki_path = "pki_int"
   }
}
cert_file = "/etc/consul.d/cert"
key_file = "/etc/consul.d/keyfile"
ca_file = "/etc/consul.d/ca"
verify_outgoing = true
verify_incoming = true
verify_server_hostname = true

Nomad:

datacenter = "hcpoc"
data_dir = "/var/lib/nomad"
log_level = "DEBUG"
server {
  enabled = true
  bootstrap_expect = 3
}

server_join {
    retry_join = [
        %{ for n in setsubtract(keys("${cluster_nodes}"), [node_id]) ~}
        "${cluster_nodes[n]}:4647",
        %{ endfor ~}
    ]
}

plugin "raw_exec" {
  config {
    enabled = true
  }
}

consul {
  address = "127.0.0.1:8501"
  server_service_name = "nomad"
  client_service_name = "nomad-client"
  auto_advertise      = true
  server_auto_join    = true
  client_auto_join    = true

  ca_file    = "/etc/consul.d/ca"
  cert_file  = "/etc/consul.d/cert"
  key_file   = "/etc/consul.d/keyfile"
  ssl        = true
  verify_ssl = true

  token = "{{ with secret "secret/data/consul/nomad_server_token" }}{{ .Data.data.token }}{{ end }}"
}

tls {
    http = true
    rpc  = true

    ca_file    = "/etc/consul.d/ca"
    cert_file  = "/etc/consul.d/cert"
    key_file   = "/etc/consul.d/keyfile"

    verify_https_client    = false
    #verify_server_hostname = true
}

telemetry {
  collection_interval = "1s"
  disable_hostname = true
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}

The job that we deploy on nomad is this one:

job "tester" {
    datacenters = ["hcpoc"]

    group "curl" {
        network {
            mode = "bridge"
            port "http" {}
        }

        service {
            name = "curl-service"
            port = "8080"

            connect {
                sidecar_service {
                    proxy {
                        upstreams {
                            destination_name = "echo-service"
                            local_bind_port = 8080
                        }
                    }
                }
            }
        }

        task "curl" {
          
            driver = "docker"

            config {
                image = "<image_repo>"
                command = "sleep"
                args = ["10000000"]
            }
        }
    }

    group "echo-server" {
        network {
            mode = "bridge"
            port "http" {}
        }

        service {
            name = "echo-service"
            tags = [ "web" ]
            port = "http",
            connect = {
                sidecar_service {}
            }
            check {
                type = "http"
                port = "http"
                path = "/health"
                interval = "30s"
                timeout = "2s"
            }
        }

        task "echo" {
            driver = "exec"

            config {
                command = "local/echo-server"
            }
          
            env {
              PORT = "${NOMAD_PORT_http}"
            }

            artifact {
                source = "gcs::<bucket-url>",
                options = {
                token = "<token>"
                },
                destination = "local/"
            }
        }
    }
}

The proxies are deployed and start running.

We noticed that the consul show us the proxies health check failed with
dial tcp 127.0.0.1:27814: connect: connection refused.

Then we investigated and in consul (of the node where proxies are deployed) logs we found this:

2020-07-27T10:29:16.833Z [WARN]  agent: grpc: Server.Serve failed to complete security handshake from "127.0.0.1:51200": remote error: tls: unknown certificate authority

Nomad shows this:

client.alloc_runner.runner_hook: error proxying to Consul: alloc_id=52406db1-7394-f56e-b1fd-1c7495c4ed26 error="readfrom tcp 127.0.0.1:52114->127.0.0.1:8502: splice: connection reset by peer" dest=127.0.0.1:8502 src_local=/var/lib/nomad/alloc/52406db1-7394-f56e-b1fd-1c7495c4ed26/alloc/tmp/consul_grpc.sock src_remote=@ bytes=138

We also query the CA config with curl -s -H "X-Consul-Token: <token>" http://localhost:8500/v1/connect/ca/roots | jq . and noticed only one CA root, with CA cert that is different from the one created in Vault.

What are we missing?

Thank you

Francesco

Hi @efbar

Thanks for posting about this, and for hopping on a call! I just wanted to post some of the notes from our conversation, so that others following along may understand what is happening.

First, we recognized that you are using the same keys & certs for Securing Agent Communication, as well as for your Connect CA via Vault. As a group, we thought to change it so that there are different keys & certs for each distinct purpose.

Second, your set up of the cluster initialized the default Connect CA. You should be able to start Connect with the Vault PKI without setting up the generic Connect CA.

Please respond here about your findings, and thank you again for the detailed post!

Hi @jsosulska

thank you (everyone) all for the help and the call too!

We proceed to try something, and we did like this, just for testing:

  1. we bootstrapped the cluster without connect enabled as you advise, and then enable it once we had all the vault certificates, this allow to see that /v1/connect/ca/configuration returned the Vault one while before was the one created by consul.

  2. now we can see another intermediate “leaf-cert” in Vault

  3. deploying the job they (the proxies) still don’t work, but appending the “new” ca roots certificate to the old one (so, one file with two certificates), restarting nomad and redeploying the nomad job we finally see the proxies working!

This method is not rock solid, we will try to follow the way you told us for the resolution, so different certs for different purposes.

I’ll keep you updated regarding this process.

Thank you

Francesco

1 Like

Happy to help move along your troubleshooting, @efbar. I’ll be watching this for more updates :slight_smile:

Hi @jsosulska,

some updates here. I think we reach the target.

After some test, we ended to configure Vault with only one pki and two roles, one for Consul and one for Nomad.

So now regarding the TLS matter, we will use the intermediate to issues a certificate for both Consul and Nomad, asking to the respective roles (and alt_names etc…) and so in the config they have separates certificate files.

But regarding the CA, for both Nomad and Consul we actually put two CAs certificate in the same file, like:

-----BEGIN CERTIFICATE-----
....
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
....
-----END CERTIFICATE-----

the first one is the root CA, and the second the intermediate we created. In this way inside the envoy json config file we can find both of them under:

  ...
   "tls_context":{
     "common_tls_context":{
       "validation_context":{
         "trusted_ca": 

the trusted_ca has both.
With this configuration, no more unknown authority or connection: refused.

Connect on the Vault side creates by itself its own intermediate with a role named leaf-cert configured with all the necessary like spiffe://* SAN, etc…
We only had to add the path of the root, the path of the intermediate and the token, then Consul with that token (and the right policies) creates that intermediate by itself. One word regarding the token in the ca_config stanza: we had to put it directly and not through a referral to a file or a path to a file. It seems that Consul couldn’t use it and so couldn’t access to Vault. Could it be possible (no problem of file permissions) ?

Anyway, with all of this, we can now have every envoy proxy working.

Thank you

Francesco

1 Like