Failed to handshake request

Hi there!

I am really excited about your solution and wanted to thank you first, I really believe that it will solve nowadays security problems regarding tunelling/bastion/access credentials rotation etc.

Now, I wanted to test it on AWS in a VPC using two EC2 instances, one for the controller and another for the worker. Both are within the same subnet and the security groups have been configured in order to make them each other reachable on the needed ports (9201 & 9202), and also to make them able to reach the internet. I also gave myself a full access to all ports (from my public ip addr).

Below you will see the configurations for the controller & the worker hosts.

Configuration of the Controller host:

# Disable memory lock: https://www.man7.org/linux/man-pages/man2/mlock.2.html
disable_mlock = true

# Controller configuration block
controller {
  # This name attr must be unique across all controller instances if running in HA mode
  name = "boundary-controller-0"

  # Database URL for postgres. This can be a direct "postgres://"
  # URL, or it can be "file://" to read the contents of a file to
  # supply the url, or "env://" to name an environment variable
  # that contains the URL.
  database {
    url = "postgresql://postgres:postgres@localhost:5432/boundary?sslmode=disable"     
  }
}

# API listener configuration block
listener "tcp" {
  # Should be the address of the NIC that the controller server will be reached on     
  address = "0.0.0.0:9200"
  # The purpose of this listener block
  purpose = "api"

  tls_disable = true
}

# Data-plane listener configuration block (used for worker coordination)
listener "tcp" {
  # Should be the IP of the NIC that the worker will connect on
  address = "172.0.0.11:9201"
  # The purpose of this listener
  purpose = "cluster"

  tls_disable = true
}

# Root KMS configuration block: this is the root key for Boundary
# Use a production KMS such as AWS KMS in production installs
kms "aead" {
  purpose = "root"
  aead_type = "aes-gcm"
  key = "sP1fnF5Xz85RrXyELHFeZg9Ad2qt4Z4bgNHVGtD6ung="
  key_id = "global_root"
}

# Worker authorization KMS
# Use a production KMS such as AWS KMS for production installs
# This key is the same key used in the worker configuration
kms "aead" {
  purpose = "worker-auth"
  aead_type = "aes-gcm"
  key = "8fZBjCUfN0TzjEGLQldGY4+iE9AkOvCfjh7+p0GtRBQ="
  key_id = "global_worker-auth"
}

Configuration of the Worker host:

listener "tcp" {
  purpose = "proxy"
  address = "172.0.0.13:9202"
  tls_disable = true
}

worker {
  # Name attr must be unique across workers
  name = "boundary-worker-0"
  address = "172.0.0.13"

  # Workers must be able to reach controllers on :9202        
    controllers = [
                "172.0.0.11",
          ]

  public_addr = "172.0.0.13"
}

# Worker authorization KMS
# Use a production KMS such as AWS KMS for production installs
# This key is the same key used in the worker configuration   
kms "aead" {
  purpose = "worker-auth"
  aead_type = "aes-gcm"
  key = "8fZBjCUfN0TzjEGLQldGY4+iE9AkOvCfjh7+p0GtRBQ="        
  key_id = "global_worker-auth"
}

I can from my computer authenticate to the API hosted by the controller host, no problem on this side. And by reading the boundary logs from both hosts, they have been able to synced.

Controller logs:

Dec 18 13:09:17 ip-172-0-0-11.eu-west-3.compute.internal boundary[10335]: Cgo: disabled
Dec 18 13:09:17 ip-172-0-0-11.eu-west-3.compute.internal boundary[10335]: Listener 1: tcp (addr: "0.0.0.0:9200", max_request_duration: "1m30s", purpose: "api")
Dec 18 13:09:17 ip-172-0-0-11.eu-west-3.compute.internal boundary[10335]: Listener 2: tcp (addr: "172.0.0.11:9201", max_request_duration: "1m30s", purpose: "cluster")
Dec 18 13:09:17 ip-172-0-0-11.eu-west-3.compute.internal boundary[10335]: Log Level: info
Dec 18 13:09:17 ip-172-0-0-11.eu-west-3.compute.internal boundary[10335]: Mlock: supported: true, enabled: false
Dec 18 13:09:17 ip-172-0-0-11.eu-west-3.compute.internal boundary[10335]: Public Cluster Addr: 172.0.0.11:9201
Dec 18 13:09:17 ip-172-0-0-11.eu-west-3.compute.internal boundary[10335]: Version: Boundary v0.1.2
Dec 18 13:09:17 ip-172-0-0-11.eu-west-3.compute.internal boundary[10335]: Version Sha: d8020842ae8b6c742b94538baada313d7eb52809
Dec 18 13:09:17 ip-172-0-0-11.eu-west-3.compute.internal boundary[10335]: ==> Boundary server started! Log data will stream in below:
Dec 18 13:12:42 ip-172-0-0-11.eu-west-3.compute.internal boundary[10335]: 2020-12-18T13:12:42.847Z [INFO]  controller: worker successfully authed: name=boundary-worker-0

Worker logs:

– Logs begin at Thu 2020-12-17 17:51:02 UTC. –
Dec 18 13:12:42 ip-172-0-0-13.eu-west-3.compute.internal boundary[5175]: [Worker-Auth] AEAD Type: aes-gcm
Dec 18 13:12:42 ip-172-0-0-13.eu-west-3.compute.internal boundary[5175]: Cgo: disabled
Dec 18 13:12:42 ip-172-0-0-13.eu-west-3.compute.internal boundary[5175]: Listener 1: tcp (addr: “172.0.0.13:9202”, max_request_duration: “1m30s”, purpose: “proxy”)
Dec 18 13:12:42 ip-172-0-0-13.eu-west-3.compute.internal boundary[5175]: Log Level: info
Dec 18 13:12:42 ip-172-0-0-13.eu-west-3.compute.internal boundary[5175]: Mlock: supported: true, enabled: true
Dec 18 13:12:42 ip-172-0-0-13.eu-west-3.compute.internal boundary[5175]: Public Addr: 172.0.0.13:9202
Dec 18 13:12:42 ip-172-0-0-13.eu-west-3.compute.internal boundary[5175]: Version: Boundary v0.1.2
Dec 18 13:12:42 ip-172-0-0-13.eu-west-3.compute.internal boundary[5175]: Version Sha: d8020842ae8b6c742b94538baada313d7eb52809
Dec 18 13:12:42 ip-172-0-0-13.eu-west-3.compute.internal boundary[5175]: ==> Boundary server started! Log data will stream in below:
Dec 18 13:12:42 ip-172-0-0-13.eu-west-3.compute.internal boundary[5175]: 2020-12-18T13:12:42.841Z [INFO] worker: connected to controller: address=172.0.0.11:9201

Then I wanted to execute an ssh request to the worker host (target = worker host 172.0.0.13 port 22) just to test if what I have deployed works correctly, I expect to get an access denied from the SSH server on the worker host since there is no Vault or SSH key that is stored somewhere, again just for testing if the controller can put a session job in the worker’s queue.

But I keep getting this error message:

> boundary connect ssh -target-id ttcp_h4xC8NcmFU -addr="<The public ALB domain name>"
> Error dialing the worker: failed to WebSocket dial: failed to send handshake request: Get "https://172.0.0.13:9202/v1/proxy": dial tcp 172.0.0.13:9202: connectex: Une tentative de connexion a échoué car le parti connecté n’a pas répondu convenablement au-delà d’une certaine durée ou une connexion établie a échoué car l’hôte de connexion n’a pas répondu.
> ssh_exchange_identification: read: Connection reset

What I don’t understand here is why it is failing since they are able to reach the needed ports and I have disabled TLS on both listeners.

In addition to the boundary logs here are the output of the netcat command on each host:

From Controller host:

[ec2-user@ip-172-0-0-11 ~]$ nc -zv 172.0.0.13 9202
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 172.0.0.13:9202.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.

From Worker Host:

nc -zv 172.0.0.11 9201
Ncat: Version 7.50 ( Ncat - Netcat for the 21st Century )
Ncat: Connected to 172.0.0.11:9201.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.

UPDATE #1

I was wondering if the listeners also listen UDP in addition to TCP?

I don’t work for Hashicorp, just trying to be helpful…

What matters is whether your user node–what you are running boundary connect on–can reach the worker on 9202, and that’s not shown. But the type of test matters too, see next bit:

A netcat null test (nc -z) only tests that you can get a SYN ACK response. Given that you are getting a localized text response in the error, there may be a firewall that is doing TCP intercept of outbound packets. In that scenario, the SYN ACK test is insufficient. You can confirm this by doing a packet capture on the worker and see if it saw the SYN at all.

The docs are unambiguous, as would be netstat -nalp |grep 920 :wink: