Trouble getting Consul Connect and Envoy to work

TLDR: After setting up a web server on one VM and configuring Connect and Envoy to access it from another VM, curl localhost:8889 results in: curl: (56) Recv failure: Connection reset by peer.


Based on both:

  1. https://learn.hashicorp.com/consul/developer-mesh/connect-envoy
  2. https://learn.hashicorp.com/consul/developer-mesh/connect-production

I set up a three-VM consul cluster, and a web server and a client on a couple of them as follows:

VM-1: the web server

  1. Runs apache web server serving a static test page. Listens on 127.0.0.1:80

  2. Runs the consul-envoy docker image as described in https://learn.hashicorp.com/consul/developer-mesh/connect-envoy, but using envory 1.13.0, with the following service definition:

    # /consul/config/web.json
    {
        "service": {
            "connect": {
                "sidecar_service": {}
            },
            "name": "web",
            "port": 80
        }
    }
    
  3. The consul-envoy container is run using the following Ansible task:

    - name: Run web service proxy
      docker_container:
        name: web-proxy
        image: consul-envoy
        auto_remove: yes
        command: -sidecar-for web
        network_mode: host
        volumes:
          - /consul/certs:/consul/certs:ro
        env:
          CONSUL_HTTP_SSL: "true"
          CONSUL_CACERT: /consul/certs/consul-ca.pem
          CONSUL_CLIENT_CERT: /consul/certs/consul-1-cli.pem
          CONSUL_CLIENT_KEY: /consul/certs/consul-1-cli-key.pem
    
  4. From the consul-envoy container, curl localhost works fine and is able to access the web server on the VM.

  5. The only bit of the consul-envoy container logs that seemed relevant is:

    [1][info][upstream] [source/server/lds_api.cc:73] lds: add/update listener 'public_listener:0.0.0.0:21000'
    

VM-2: the web client

  1. Runs the same consul-envoy docker image as VM-1, with the following service definition:

    # /consul/config/web-client.json
    {
        "service": {
            "connect": {
                "sidecar_service": {
                    "proxy": {
                        "upstreams": [
                            {
                                "destination_name": "web",
                                "local_bind_port": 8889
                            }
                        ]
                    }
                }
            },
            "name": "web-client",
            "port": 8888
        }
    }
    
  2. The consul-envoy container is run using the following Ansible task:

    - name: Run web client service proxy
      docker_container:
        name: web-client-proxy
        image: consul-envoy
        auto_remove: yes
        command: -sidecar-for web-client
        network_mode: host
        volumes:
          - /consul/certs:/consul/certs:ro
        env:
          CONSUL_HTTP_SSL: "true"
          CONSUL_CACERT: /consul/certs/consul-ca.pem
          CONSUL_CLIENT_CERT: /consul/certs/consul-2-cli.pem
          CONSUL_CLIENT_KEY: /consul/certs/consul-2-cli-key.pem
    
  3. Running curl localhost:8889 results in:

    curl: (56) Recv failure: Connection reset by peer
    

    This is the problem I’m facing!
    Why isn’t the client able to get through to the server?!

    More generally, how can I go about debuggin this situation? I expected these tools to make it easier to trace issues, but I’m not sure where to start. The consul-envoy logs don’t seem to contain anything relevant, AFAICT. I was only able to find this line:

    [1][info][upstream] [source/server/lds_api.cc:73] lds: add/update listener 'web:127.0.0.1:8889'
    

Consul Agent

Consul is run as a Docker container using the following Ansible task:

- name: Start consul container
  docker_container:
    name: "{{ inventory_hostname }}"  # consul-1, consul-2, consul-3
    image: consul
    network_mode: host
    command: agent -server -bind={{ ansible_default_ipv4.address }}
    volumes:
      - /consul/data:/consul/data:rw
      - /consul/certs:/consul/certs:ro
      - /consul/config:/consul/config:rw

Consul Agent Configuration

/consul/config/agent.json:

{
    "auto_encrypt": {
        "allow_tls": true
    },
    "bootstrap_expect": 3,
    "ca_file": "/consul/certs/consul-ca.pem",
    "cert_file": "/consul/certs/consul-1.pem",  # different file for each agent
    "connect": {
        "ca_config": {
            "private_key": "-----BEGIN EC PRIVATE KEY----- ...",
            "root_cert": "-----BEGIN CERTIFICATE----- ..."
        },
        "ca_provider": "consul",
        "enabled": true
    },
    "datacenter": "test",
    "encrypt": "<encryption-key>",
    "key_file": "/consul/certs/consul-1-key.pem",  # different file for each agent
    "performance": {
        "raft_multiplier": 1
    },
    "ports": {
        "grpc": 8502,
        "http": -1,
        "https": 8500
    },
    "retry_join": [
        "192.168.121.89",
        "192.168.121.2",
        "192.168.121.202"
    ],
    "server": true,
    "ui": true,
    "verify_incoming": true,
    "verify_incoming_rpc": true,
    "verify_outgoing": true,
    "verify_server_hostname": true
}

I hope I did not miss any relevant information. Please feel free to ask for any.

PS. I think that adding a guide for a similar setup to this to the Learn articles would be valuable. The existing guides are far from production-ready, and are all based on containers on a single host, which is almost never desired.

Is the port 8889 exposed out of the container? Can you try to identify if the port is listening using netstat inside of the container and reachable using telnet from somewhere else?

The curl command is executed on the VM-2 host, where the web-client-proxy is running, and Envoy is indeed listening:

> netstat -lntp | grep 8889
tcp        0      0 127.0.0.1:8889          0.0.0.0:*               LISTEN      29354/envoy

In fact, the error I’m getting is different from the error I get on a port where nothing is listening:

> curl localhost:8889
curl: (56) Recv failure: Connection reset by peer
> curl localhost:8800
curl: (7) Failed to connect to localhost port 8800: Connection refused

All containers are run using Docker’s host networking.

netcat/telnet disconnects immediately, both from VM-2 and from withing the consul-envoy client proxy container (web-client-proxy).

What I found curious is that for the four services:

  1. web
  2. web-client
  3. web-sidecar-proxy
  4. web-client-sidecar-proxy

The ServiceConnect key is empty:

> curl https://localhost:8500/v1/catalog/service/web?pretty
{
    ...
    "ServiceConnect": {},
    ...
}

Is that how it should be?

What about telnet-ing from vm-2 to vm-1 using the sidecar-proxy port (guessing 21000)? The connection should be established between the proxies.

Another idea are intensions, but it seems that your setup doesn’t have them enabled…

I’m able to netcat/telnet from vm-2 to vm-1:21000. I get a prompt.

I have no configured intentions:

> curl https://localhost:8500/v1/connect/intentions?pretty
[]

So, how come I cannot see any error, neither in consul nor in envoy logs?
How can I debug such issues?

BTW, here is an Ansible playbook that will reproduce my entire setup using three Debian 10 hosts:

After enabling debug logs on Envoy, I can now see certificate-related errors in both Envoy proxies:

On consul-1 (server):
[2020-03-15 13:39:57.341][31][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:226] [C2365] TLS error: 268436504:SSL routines:OPENSSL_internal:TLSV1_ALERT_UNKNOWN_CA

On consul-2 (client):
[2020-03-15 13:39:57.335][27][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:226] [C263] TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED

Any clue about what may be causing these errors?

Are any of the certificates generated incorrectly?

Figured out what the issue was :wink:

The Connect CA was the culprit. Be removing my Connect CA certificate and key (connect.ca_config), I got Connect to generate its own CA, which did work!

The main difference I can see between my Connect CA and the one Connect generated is that Connect sets a SAN to the cluster identifier with the .consul TLD, as specified in the documentation.

Now, the documentation also states that the .consul TLD cluster identifier “can be found using the CA List Roots endpoint.”

How can I generate the CA certificate using an ID that is found by querying Connect, after it had created its own CA?!

I seem to be having a variant of this problem.

Did you ever make progress on it?

In my specific case, I am using Vault as the CA for Connect.

I have an existing PKI root mounted on Vault which is an intermediate CA from an offline root CA, and this is being used as the “root” from Connect’s perspective; it’s then standing up its own “intermediate” in Vault, which is then being used to sign the various service certificates.

The PKI root for Vault knows nothing about Connect; it’s just a regular CA; it doesn’t have any .consul DNS SAN or SPIFFE URI SAN. The Connect intermediate from Vault is signed with a couple of SANs, e.g.:

 DNS:pri-1gbfprc.vault.ca.1421127e.consul, URI:spiffe://1421127e-5e51-9ed2-8543-ab686fc4a16f.consul

The service leaf certs also look fine, and are signed with SANs like:

 DNS:echo.svc.default.1421127e.consul, URI:spiffe://1421127e-5e51-9ed2-8543-ab686fc4a16f.consul/ns/default/dc/dc1/svc/echo

I have validated the entire chain with openssl and it’s OK.

When you ask Connect for the root CA, it gives back the Vault root which (still) doesn’t have any of the Consul DNS or SPIFFE SANs - this conflicts with that documentation mentioned above.

This setup works fine for the builtin Connect proxy. It comes up with a message like:

 2020-05-08T12:21:02.073Z [INFO]  proxy: Parsed TLS identity: uri=spiffe://1421127e-5e51-9ed2-8543-ab686fc4a16f.consul/ns/default/dc/dc1/svc/echo roots=["<my Vault root CN>"]

I am able to get the builtin proxy to connect and behave just fine. When I try to switch to Envoy, the wheels come off and Envoy gives errors indicating the CA is untrusted exactly as described in the posts above.

If it makes a difference the offline root and the Vault PKI root are RSA, but the Connect certificates (intermediate and leaf) are ECDSA. I am using Vault 1.4.1, Consul 1.7.3 and Envoy 1.13.0.

No ideas?

Has anyone gotten Consul Connect to work with Envoy with Vault? Anyone at all?

Sorry, I don’t have any updates. I settled for the Connect-generated CA for now. I’m not using Vault, anyway.