Verify_server_hostname doesn't work?

According to the documentation If verify_server_hostname is set, then outgoing connections perform hostname verification. All servers must have a certificate valid for server.<datacenter>.<domain> or the client will reject the handshake. This is a new configuration as of 0.5.1, and it is used to prevent a compromised client from being able to restart in server mode and perform a MITM (Man-In-The-Middle) attack. New deployments should set this to true, and generate the proper certificates, but this is defaulted to false to avoid breaking existing deployments.

I tried to test this behaviour before deploing production cluser using ‘learn-repo’. I generated client cert and use the example from ‘datacenter-deploy-secure’ folder.

As a result I see that client joined to the cluster as a server:
/ # consul members
Node Address Status Type Build Protocol DC Partition Segment
consul-client 172.19.0.2:8301 alive server 1.11.2 2 dc1 default
consul-server1 172.19.0.4:8301 alive server 1.11.2 2 dc1 default
consul-server2 172.19.0.5:8301 alive server 1.11.2 2 dc1 default
consul-server3 172.19.0.3:8301 alive server 1.11.2 2 dc1 default

And it is even a voter:
/ # consul operator raft list-peers
Node ID Address State Voter RaftProtocol
consul-server2 29b4df30-023c-440a-e4fa-52d8a8e7ea36 172.19.0.5:8300 follower true 3
consul-server1 636bb8d9-102c-c208-5d25-f7f451739c70 172.19.0.4:8300 leader true 3
consul-client 0c3e5c6a-1f52-7eab-5b0b-87de3b90f1c1 172.19.0.2:8300 follower true 3
consul-server3 4ae9f80c-d082-943b-7521-67226206c265 172.19.0.3:8300 follower true 3

The only place where I see any problems is client logs with messages:
2024-05-08T09:18:56.166Z [ERROR] agent.server.rpc: failed to read byte: conn=from=172.19.0.4:53752 error=“remote error: tls: bad certificate”
2024-05-08T09:18:58.185Z [ERROR] agent.server.rpc: failed to read byte: conn=from=172.19.0.4:53760 error=“remote error: tls: bad certificate”
2024-05-08T09:18:58.188Z [ERROR] agent.server.rpc: failed to read byte: conn=from=172.19.0.4:53770 error=“remote error: tls: bad certificate”
2024-05-08T09:18:58.515Z [ERROR] agent.server.rpc: failed to read byte: conn=from=172.19.0.4:53786 error=“remote error: tls: bad certificate”
2024-05-08T09:18:58.592Z [ERROR] agent: Coordinate update error: error=“No cluster leader”

For me looks very strange, so I want to ask: Option ‘verify_server_hostname’ is intended to work this way or I have something misconfigured?

Hi @victorvoronin,

Welcome to the HashiCorp Forum!

I had a quick look at the repo, and it would work as expected as long as you make the following changes:

  1. Create new client certificates inside the certs folder. Run the folllwing command from the certs directory so that the new certs are created using the same CA certificates.

    $ consul tls cert create -client
    
  2. Modify the client.json with the following changes

    • Add "sever": true
    • Fix the certificate path to point to the client certificate.
    $ git diff client.json 
    diff --git a/datacenter-deploy-secure/client.json b/datacenter-deploy-secure/client.json
    index 09aca64..ba0bd77 100644
    --- a/datacenter-deploy-secure/client.json
    +++ b/datacenter-deploy-secure/client.json
    @@ -1,4 +1,5 @@
     {
    +    "server": true,
         "node_name": "consul-client",
         "data_dir": "/consul/data",
         "retry_join":[
    @@ -11,6 +12,6 @@
         "verify_outgoing": true,
         "verify_server_hostname": true,
         "ca_file": "/consul/config/certs/consul-agent-ca.pem",
    -    "cert_file": "/consul/config/certs/dc1-server-consul-0.pem",
    -    "key_file": "/consul/config/certs/dc1-server-consul-0-key.pem"
    +    "cert_file": "/consul/config/certs/dc1-client-consul-0.pem",
    +    "key_file": "/consul/config/certs/dc1-client-consul-0-key.pem"
     }
    

Once you do the above changes and start all the agents, you will find that all the consul-server-* containers would throw the following error and not become part of the raft pool.

consul-server2  | 2024-05-13T13:30:20.733Z [ERROR] agent.server.raft: failed to heartbeat to: peer=172.19.0.5:8300 error="x509: certificate is valid for client.dc1.consul, localhost, not server.dc1.consul"
consul-server2  | 2024-05-13T13:30:22.174Z [WARN]  agent: error getting server health from server: server=consul-client error="rpc error getting client: failed to get conn: x509: certificate is valid for client.dc1.consul, localhost, not server.dc1.consul"

Please note that you will still find the client join as a server when you run consul members (serf), but when you query the Raft status, you will see that it never successfully joins the raft pool (thereby not getting access to the full cluster data).

Eg:

sudo docker exec -it consul-server1 sh
/ # consul members
Node            Address          Status  Type    Build   Protocol  DC   Partition  Segment
consul-client   172.19.0.5:8301  alive   server  1.11.2  2         dc1  default    <all>
consul-server1  172.19.0.4:8301  alive   server  1.11.2  2         dc1  default    <all>
consul-server2  172.19.0.3:8301  alive   server  1.11.2  2         dc1  default    <all>
consul-server3  172.19.0.2:8301  alive   server  1.11.2  2         dc1  default    <all>
/ # consul operator raft list-peers
Node            ID                                    Address          State     Voter  RaftProtocol
consul-server2  5cf676be-a0e4-b60f-7d0f-692816b698bd  172.19.0.3:8300  leader    true   3
consul-server3  477f04ca-b34c-3181-7fdd-f34310e0fa82  172.19.0.2:8300  follower  true   3
consul-server1  e6d209e0-e8df-89ab-9104-df160e591fb1  172.19.0.4:8300  follower  true   3

NOTE: Please also note that it would have worked (client become a server) if you only changed the config by adding "server": true and restart the consul-client container. This is because in the codebase, the client is also reusing the server certificate. I hope you have noticed this, but I just wanted to highlight it if you haven’t.
I hope this helps!

@Ranjandas Thank you for your answer, But

I forgot to mention that of course I done the step with issuing client certs. Furthermore I’ve enabled https only. And I see messages like yours in log, about “x509: certificate is valid for client.dc1.consul, localhost, not server.dc1.consul”, but on the other hand client always present in raft list-peers, sometimes as voter, sometimes as NonVoter.
Here is an example:

client.config

{
    "node_name": "consul-client",
    "server": true,
    "data_dir": "/consul/data",
    "retry_join":[
        "consul-server1",
        "consul-server2",
        "consul-server3"
     ],
    "encrypt": "aPuGh+5UDskRAbkLaXRzFoSOcSM+5vAK+NEYOWHJH7w=",
    "verify_incoming": true,
    "verify_outgoing": true,
    "verify_server_hostname": true,
    "verify_incoming_rpc": true,
    "verify_incoming_https": true,
    "ca_file": "/consul/config/certs/consul-agent-ca.pem",
    "cert_file": "/consul/config/certs/dc1-client-consul-0.pem",
    "key_file": "/consul/config/certs/dc1-client-consul-0-key.pem"
}

commands output:

/ # CONSUL_HTTP_SSL=true consul members -http-addr=https://127.0.0.1:8501 -ca-file=/consul/config/certs/consul-agent-ca.pem -client-cert=/consul/config/certs/dc1-server-consul-0.pem -clien
t-key=/consul/config/certs/dc1-server-consul-0-key.pem
Node            Address          Status  Type    Build   Protocol  DC   Partition  Segment
consul-client   172.22.0.5:8301  alive   server  1.11.2  2         dc1  default    <all>
consul-server1  172.22.0.4:8301  alive   server  1.11.2  2         dc1  default    <all>
consul-server2  172.22.0.2:8301  alive   server  1.11.2  2         dc1  default    <all>
consul-server3  172.22.0.3:8301  alive   server  1.11.2  2         dc1  default    <all>
/ # CONSUL_HTTP_SSL=true consul operator raft list-peers -http-addr=https://127.0.0.1:8501 -ca-file=/consul/config/certs/consul-agent-ca.pem -client-cert=/consul/config/certs/dc1-server-co
nsul-0.pem -client-key=/consul/config/certs/dc1-server-consul-0-key.pem
Node            ID                                    Address          State     Voter  RaftProtocol
consul-server2  2ee1f361-1505-89a5-fc73-14b0d7ebd270  172.22.0.2:8300  leader    true   3
consul-server1  a1f42be7-d356-5c28-43d7-09821d000517  172.22.0.4:8300  follower  true   3
consul-server3  6db92907-2111-c523-6041-56f6d63ed427  172.22.0.3:8300  follower  true   3
consul-client   6e72d36c-32b2-eee6-8a1b-1edda10f1c1a  172.22.0.5:8300  follower  false  3

piece of logs:

2024-05-14T07:41:47.982Z [ERROR] agent.server.raft: failed to appendEntries to: peer="{Nonvoter 6e72d36c-32b2-eee6-8a1b-1edda10f1c1a 172.22.0.5:8300}" error="x509: certificate is valid for client.dc1.consul, localhost, not server.dc1.consul"
2024-05-14T07:41:48.539Z [WARN]  agent: error getting server health from server: server=consul-client error="context deadline exceeded"
2024-05-14T07:41:49.075Z [ERROR] agent.server.raft: failed to heartbeat to: peer=172.22.0.5:8300 error="x509: certificate is valid for client.dc1.consul, localhost, not server.dc1.consul"
2024-05-14T07:41:49.540Z [WARN]  agent: error getting server health from server: server=consul-client error="rpc error getting client: failed to get conn: x509: certificate is valid for client.dc1.consul, localhost, not server.dc1.consul"
2024-05-14T07:41:50.538Z [WARN]  agent: error getting server health from server: server=consul-client error="context deadline exceeded"
2024-05-14T07:41:51.540Z [WARN]  agent: error getting server health from server: server=consul-client error="rpc error getting client: failed to get conn: x509: certificate is valid for client.dc1.consul, localhost, not server.dc1.consul"
2024-05-14T07:41:52.538Z [WARN]  agent: error getting server health from server: server=consul-client error="context deadline exceeded"
2024-05-14T07:41:53.541Z [WARN]  agent: error getting server health from server: server=consul-client error="rpc error getting client: failed to get conn: x509: certificate is valid for client.dc1.consul, localhost, not server.dc1.consul"
2024-05-14T07:41:54.539Z [WARN]  agent: error getting server health from server: server=consul-client error="context deadline exceeded"
2024-05-14T07:41:55.540Z [WARN]  agent: error getting server health from server: server=consul-client error="rpc error getting client: failed to get conn: x509: certificate is valid for client.dc1.consul, localhost, not server.dc1.consul"
2024-05-14T07:41:56.538Z [WARN]  agent: error getting server health from server: server=consul-client error="context deadline exceeded"
2024-05-14T07:41:57.540Z [WARN]  agent: error getting server health from server: server=consul-client error="rpc error getting client: failed to get conn: x509: certificate is valid for client.dc1.consul, localhost, not server.dc1.consul"

For example, I can even see the following raft status:

Node            ID                                    Address          State     Voter  RaftProtocol
consul-server2  0d84045e-e514-deec-9fd3-051ba10196cc  172.22.0.5:8300  follower  true   3
consul-server3  75b38da9-26ac-8018-8a1d-85f818bbd832  172.22.0.4:8300  leader    true   3
consul-client   b7c95830-7a09-64ed-4955-0d1a8916ad2c  172.22.0.2:8300  follower  true   3
consul-server1  9cdcb68d-1bf4-8f5b-c0b6-2670d90bf623  172.22.0.3:8300  follower  false  3

That is why I’m really concerned about this