Trouble getting Consul Connect and Envoy to work

TLDR: After setting up a web server on one VM and configuring Connect and Envoy to access it from another VM, curl localhost:8889 results in: curl: (56) Recv failure: Connection reset by peer.


Based on both:

  1. https://learn.hashicorp.com/consul/developer-mesh/connect-envoy
  2. https://learn.hashicorp.com/consul/developer-mesh/connect-production

I set up a three-VM consul cluster, and a web server and a client on a couple of them as follows:

VM-1: the web server

  1. Runs apache web server serving a static test page. Listens on 127.0.0.1:80

  2. Runs the consul-envoy docker image as described in https://learn.hashicorp.com/consul/developer-mesh/connect-envoy, but using envory 1.13.0, with the following service definition:

    # /consul/config/web.json
    {
        "service": {
            "connect": {
                "sidecar_service": {}
            },
            "name": "web",
            "port": 80
        }
    }
    
  3. The consul-envoy container is run using the following Ansible task:

    - name: Run web service proxy
      docker_container:
        name: web-proxy
        image: consul-envoy
        auto_remove: yes
        command: -sidecar-for web
        network_mode: host
        volumes:
          - /consul/certs:/consul/certs:ro
        env:
          CONSUL_HTTP_SSL: "true"
          CONSUL_CACERT: /consul/certs/consul-ca.pem
          CONSUL_CLIENT_CERT: /consul/certs/consul-1-cli.pem
          CONSUL_CLIENT_KEY: /consul/certs/consul-1-cli-key.pem
    
  4. From the consul-envoy container, curl localhost works fine and is able to access the web server on the VM.

  5. The only bit of the consul-envoy container logs that seemed relevant is:

    [1][info][upstream] [source/server/lds_api.cc:73] lds: add/update listener 'public_listener:0.0.0.0:21000'
    

VM-2: the web client

  1. Runs the same consul-envoy docker image as VM-1, with the following service definition:

    # /consul/config/web-client.json
    {
        "service": {
            "connect": {
                "sidecar_service": {
                    "proxy": {
                        "upstreams": [
                            {
                                "destination_name": "web",
                                "local_bind_port": 8889
                            }
                        ]
                    }
                }
            },
            "name": "web-client",
            "port": 8888
        }
    }
    
  2. The consul-envoy container is run using the following Ansible task:

    - name: Run web client service proxy
      docker_container:
        name: web-client-proxy
        image: consul-envoy
        auto_remove: yes
        command: -sidecar-for web-client
        network_mode: host
        volumes:
          - /consul/certs:/consul/certs:ro
        env:
          CONSUL_HTTP_SSL: "true"
          CONSUL_CACERT: /consul/certs/consul-ca.pem
          CONSUL_CLIENT_CERT: /consul/certs/consul-2-cli.pem
          CONSUL_CLIENT_KEY: /consul/certs/consul-2-cli-key.pem
    
  3. Running curl localhost:8889 results in:

    curl: (56) Recv failure: Connection reset by peer
    

    This is the problem I’m facing!
    Why isn’t the client able to get through to the server?!

    More generally, how can I go about debuggin this situation? I expected these tools to make it easier to trace issues, but I’m not sure where to start. The consul-envoy logs don’t seem to contain anything relevant, AFAICT. I was only able to find this line:

    [1][info][upstream] [source/server/lds_api.cc:73] lds: add/update listener 'web:127.0.0.1:8889'
    

Consul Agent

Consul is run as a Docker container using the following Ansible task:

- name: Start consul container
  docker_container:
    name: "{{ inventory_hostname }}"  # consul-1, consul-2, consul-3
    image: consul
    network_mode: host
    command: agent -server -bind={{ ansible_default_ipv4.address }}
    volumes:
      - /consul/data:/consul/data:rw
      - /consul/certs:/consul/certs:ro
      - /consul/config:/consul/config:rw

Consul Agent Configuration

/consul/config/agent.json:

{
    "auto_encrypt": {
        "allow_tls": true
    },
    "bootstrap_expect": 3,
    "ca_file": "/consul/certs/consul-ca.pem",
    "cert_file": "/consul/certs/consul-1.pem",  # different file for each agent
    "connect": {
        "ca_config": {
            "private_key": "-----BEGIN EC PRIVATE KEY----- ...",
            "root_cert": "-----BEGIN CERTIFICATE----- ..."
        },
        "ca_provider": "consul",
        "enabled": true
    },
    "datacenter": "test",
    "encrypt": "<encryption-key>",
    "key_file": "/consul/certs/consul-1-key.pem",  # different file for each agent
    "performance": {
        "raft_multiplier": 1
    },
    "ports": {
        "grpc": 8502,
        "http": -1,
        "https": 8500
    },
    "retry_join": [
        "192.168.121.89",
        "192.168.121.2",
        "192.168.121.202"
    ],
    "server": true,
    "ui": true,
    "verify_incoming": true,
    "verify_incoming_rpc": true,
    "verify_outgoing": true,
    "verify_server_hostname": true
}

I hope I did not miss any relevant information. Please feel free to ask for any.

PS. I think that adding a guide for a similar setup to this to the Learn articles would be valuable. The existing guides are far from production-ready, and are all based on containers on a single host, which is almost never desired.

Is the port 8889 exposed out of the container? Can you try to identify if the port is listening using netstat inside of the container and reachable using telnet from somewhere else?

The curl command is executed on the VM-2 host, where the web-client-proxy is running, and Envoy is indeed listening:

> netstat -lntp | grep 8889
tcp        0      0 127.0.0.1:8889          0.0.0.0:*               LISTEN      29354/envoy

In fact, the error I’m getting is different from the error I get on a port where nothing is listening:

> curl localhost:8889
curl: (56) Recv failure: Connection reset by peer
> curl localhost:8800
curl: (7) Failed to connect to localhost port 8800: Connection refused

All containers are run using Docker’s host networking.

netcat/telnet disconnects immediately, both from VM-2 and from withing the consul-envoy client proxy container (web-client-proxy).

What I found curious is that for the four services:

  1. web
  2. web-client
  3. web-sidecar-proxy
  4. web-client-sidecar-proxy

The ServiceConnect key is empty:

> curl https://localhost:8500/v1/catalog/service/web?pretty
{
    ...
    "ServiceConnect": {},
    ...
}

Is that how it should be?

What about telnet-ing from vm-2 to vm-1 using the sidecar-proxy port (guessing 21000)? The connection should be established between the proxies.

Another idea are intensions, but it seems that your setup doesn’t have them enabled…

I’m able to netcat/telnet from vm-2 to vm-1:21000. I get a prompt.

I have no configured intentions:

> curl https://localhost:8500/v1/connect/intentions?pretty
[]

So, how come I cannot see any error, neither in consul nor in envoy logs?
How can I debug such issues?

BTW, here is an Ansible playbook that will reproduce my entire setup using three Debian 10 hosts:

After enabling debug logs on Envoy, I can now see certificate-related errors in both Envoy proxies:

On consul-1 (server):
[2020-03-15 13:39:57.341][31][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:226] [C2365] TLS error: 268436504:SSL routines:OPENSSL_internal:TLSV1_ALERT_UNKNOWN_CA

On consul-2 (client):
[2020-03-15 13:39:57.335][27][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:226] [C263] TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED

Any clue about what may be causing these errors?

Are any of the certificates generated incorrectly?

Figured out what the issue was :wink:

The Connect CA was the culprit. Be removing my Connect CA certificate and key (connect.ca_config), I got Connect to generate its own CA, which did work!

The main difference I can see between my Connect CA and the one Connect generated is that Connect sets a SAN to the cluster identifier with the .consul TLD, as specified in the documentation.

Now, the documentation also states that the .consul TLD cluster identifier “can be found using the CA List Roots endpoint.”

How can I generate the CA certificate using an ID that is found by querying Connect, after it had created its own CA?!

I seem to be having a variant of this problem.

Did you ever make progress on it?

In my specific case, I am using Vault as the CA for Connect.

I have an existing PKI root mounted on Vault which is an intermediate CA from an offline root CA, and this is being used as the “root” from Connect’s perspective; it’s then standing up its own “intermediate” in Vault, which is then being used to sign the various service certificates.

The PKI root for Vault knows nothing about Connect; it’s just a regular CA; it doesn’t have any .consul DNS SAN or SPIFFE URI SAN. The Connect intermediate from Vault is signed with a couple of SANs, e.g.:

 DNS:pri-1gbfprc.vault.ca.1421127e.consul, URI:spiffe://1421127e-5e51-9ed2-8543-ab686fc4a16f.consul

The service leaf certs also look fine, and are signed with SANs like:

 DNS:echo.svc.default.1421127e.consul, URI:spiffe://1421127e-5e51-9ed2-8543-ab686fc4a16f.consul/ns/default/dc/dc1/svc/echo

I have validated the entire chain with openssl and it’s OK.

When you ask Connect for the root CA, it gives back the Vault root which (still) doesn’t have any of the Consul DNS or SPIFFE SANs - this conflicts with that documentation mentioned above.

This setup works fine for the builtin Connect proxy. It comes up with a message like:

 2020-05-08T12:21:02.073Z [INFO]  proxy: Parsed TLS identity: uri=spiffe://1421127e-5e51-9ed2-8543-ab686fc4a16f.consul/ns/default/dc/dc1/svc/echo roots=["<my Vault root CN>"]

I am able to get the builtin proxy to connect and behave just fine. When I try to switch to Envoy, the wheels come off and Envoy gives errors indicating the CA is untrusted exactly as described in the posts above.

If it makes a difference the offline root and the Vault PKI root are RSA, but the Connect certificates (intermediate and leaf) are ECDSA. I am using Vault 1.4.1, Consul 1.7.3 and Envoy 1.13.0.

No ideas?

Has anyone gotten Consul Connect to work with Envoy with Vault? Anyone at all?

Sorry, I don’t have any updates. I settled for the Connect-generated CA for now. I’m not using Vault, anyway.

I almost got this working fully, but by the time I was testing traffic through consul ingress gateway => sidecar proxy, the certs in play from vault were expired, we have them on about a 30 day ttl, I was 4 days too late. I was able to see it hit envoy on the sidecar proxy side, but it was throwing expired cert errors in the envoy trace logs, so I didn’t get it far enough to see if the errors you hit would’ve happened to me or not. I was close though to seeing it work end to end. Our setup on vault side was letting consul connect ca create both the root and intermediate pki mounts. I believe I originally bricked the cluster by trying to use an offline/out of band root for the CA, which if I understand, is what you’re doing.

To make matters worse around the expiration, I was unable to make consul connect ca update via cli to new pki mounts on the vault side, nor get consul connect to switch back to consul as the CA provider. I was stuck and it hosed my cluster and I ended up setting up new consul server nodes. I ended up using consul’s built in CA provider and things have worked well so far going that route. Created the issue here about all that.

In my opinion based on my experience using this kind of setup at this time, using vault as the CA provider with consul connect can be a “cluster brick maker” with some of the points of failure I’ve experienced that seem pretty serious to me. I don’t feel comfortable using it in a production setup anymore with vault as the CA provider for consul connect myself until some issues get fixed. It can take your consul cluster down if you’re not careful. You also have to contend with updating that vault periodic token as well and devise a way to refresh it yourself. If that expires, then no more certs in your mesh until you update it to my knowledge, can be another cluster down/app down type of event.

Fuck, I FINALLY just got this all sorted and stable myself - here are some notes:

Just a note - if multiple Consul datacenters share ANY pki mounts under Vault, they will compete and overwrite the CA key+cert, eventually resulting in error fetching CA certificate: stored CA information not able to be parsed when they inevitably hit the following flow:

  • dc2 hits /pki_consul_intermediate/intermediate/generate/
  • dc3 hits /pki_consul_intermediate/intermediate/generate/
  • dc2 hit2 /pki_consul_intermediate/intermediate/set-signed

hashicorp/vault#2685
this issue will eventually happen as a direct result

This makes it extremely painful if you want to scale to N secondary datacenters with a single primary datacenter, as you end up having to have N pki mounts - when you are working with about 30+ datacenters it gets ridiculous.

I believe there may be a race condition of some sort (or it’s just broken; after the amount of hours I’ve invested in troubleshooting Consul problems I don’t care to track down those specifics anymore, only to have the issue closed by someone like @jefferai without looking very far into it).

Incidentally, there seems to also be another issue that seems to cause hashicorp/vault#2685 which is that when using the Vault provider for Terraform, you can set /pki_consul_intermediate/config/urls and /pki_consul_intermediate/config/urls prior to Consul hitting /pki_consul_intermediate/intermediate/generate/ , which also seems to fuck up the CA keypair under some circumstances.

Lastly, make sure you are including both the Consul pki root AND intermediate certs in the SAME file under either ca_path or ca_file , otherwise you’ll be up against remote error: tls: bad certificate until you realize that only a self-signed CA certificate can ever be considered, well, a certificate authority (specifically a root). You CANNOT have /etc/consul.d/ssl.ca.d/root.pem AND /etc/consul.d/ssl.ca.d/intermediate.pem - they MUST be in a single file as a chain.

Fuck, this was frustrating to sort out, and Consul/Envoy do very little to help with their logs.

Also, you MUST have the following for any sidecars:

consul connect envoy \
       -grpc-addr=https://127.0.0.1:8502 \
       -ca-file=/etc/consul.d/ssl.chain.pem \ <----- this must be INTERMEDIATE/ROOT in same file, in that order; you can use `-ca-path` but it's the same deal `/etc/consul.d/ssl.ca.d/ssl.chain.pem` with INTERMEDIATE/ROOT in *SINGLE* file
       -client-cert=/etc/consul.d/ssl.crt.pem \ <---- only the client cert, or it will fuck up
       -client-key=/etc/consul.d/ssl.key.pem \
       -http-addr=https://127.0.0.1:8501 \
       -tls-server-name=server.dc1.consul \
       -token=...gateway_token... \
       -admin-bind 127.0.0.1:19005 \
       -envoy-version=1.14.2 \
       -gateway=mesh \
       -register \
       -address "...lan...:8080" \
       -wan-address "...wan...:8080"\
       -service "gateway-primary"

My last major complaint, is that there is absolutely no way to have a deterministic cluster id, so it’s literally impossible to bootstrap a cluster with a static:

connect = {
  enabled = true
  ca_provider = "consul"
  ca_config   = {
    private_key        = "...CONSUL_CA_KEY_CONTENTS..."
    root_cert          = "...CONSUL_CA_CRT_CONTENTS..."

or the equivalent with Vault and vault write pki/config/ca pem_bundle="@/tmp/vault.bundle.pem", as the first bootstrap MUST happen before the cluster id is ever known in order so sign a CA cert with a SAN of URI: spiffe://...cluster id....consul

Consul Connect (specifically the PKI) is easily one of the most frustrating tools I’ve ever worked with - it’s right up there with Asterisk and complex multiple layer NAT traversal.

1 Like

Hi @akhayyat,

How did you generate custom Connect CA cert with the SAN set to the cluster identifier with the .consul TLD ?

Seems like I am facing the same issue.

FYI: I am trying to create a self-signed root CA here. (not the intermediate).

Should I create an Intermediate CA from this self signed Root CA and use this Intermediate as a Root CA in consul Connect CA configuration, provided that I add a SAN to the Intermediate Cert while generating it ?

Thanks

I just did 2 things here:

  1. Upgraded openssl to version 1.1.1 and created a self signed Root CA with the SAN added to it like below:

      X509v3 Subject Alternative Name: 
             DNS:42cd22f0-b7e2-53d4-774a-d681cd04b921.consul
    
  2. I created an intermediate CA and adding .consul TLD SAN to it only and not the Root CA (because this can be legacy)

Both the methods give:

curl: (56) Recv failure: Connection reset by peer

What can possibly be wrong here ?

I agree on it. We need to bootstrap with defaults to get the cluster id.

Are you using Vault as the Connect CA (ie. with root pki + intermediate pki paths mounted), or the internal Consul CA?

If you are using Vault, you can allow Consul to generate the first CA keypairs for root+intermediate, then pull the values from those (along with https://www.consul.io/api/connect/ca#list-ca-root-certificates TrustDomain) when using your own (air-gapped) root CA outside Vault and limiting Vault to be intermediate exclusively.

If you are using Consul’s built-in CA you only need a root CA keypair - you can still use a proper air-gapped root, sign an intermediate, and use the intermediate as Consul’s “root”; you just need to be careful to include the full chain in a single file (ie. INTERMEDIATE_2/INTERMEDIATE_1/ROOT same file).

curl: (56) Recv failure: Connection reset by peer is quite frustrating, as it gives little to go off of; if you check the logs on the consul server side (journalctl -efu consul) you’ll likely see something along the lines of untrusted CA but not under every circumstance I believe.

The biggest thing is making sure you have the -ca-file parameter for consul connect envoy ... (I wasn’t able to get -ca-path to work at all, even with a single chain file) correct, as well as on the consul server side. The consul agent https ca_path must include the same CA chain file (I find it is difficult to achieve full TLS with validation while using ca_file on the server side, as there are only a few restrictive paths to using a single CA for both securing the agent with https AND connect), and you need to be sure you are passing a valid client cert to consul connect envoy ... as well (ie. -client-cert=... + -client-key=...)

Ultimately, you should be able to run the following commands successfully:

# on servers and clients
echo "Q" | openssl s_client -connect 127.0.0.1:8501 -showcerts | openssl x509 -in - > [server|client].consul.dc1.example.internal.crt.pem
openssl verify -verbose -CAfile ca.example.internal.chain.crt [server|client].consul.dc1.example.internal.crt.pem
# should be valid

echo "Q" | openssl s_client -connect 127.0.0.1:8502 -showcerts | openssl x509 -in - > [server|client].consul.dc1.example.internal.crt.pem
openssl verify -verbose -CAfile ca.example.internal.chain.crt [server|client].consul.dc1.example.internal.crt.pem
# should be valid

as well as:

openssl verify -verbose -CAfile ca.example.internal.chain.crt client.consul.dc1.example.internal.crt.pem
# should be valid, with `client.consul.dc1.example.internal.crt.pem` being your `consul connect envoy ... -client-cert=client.consul.dc1.example.internal.crt.pem`