Can't connect 2 servers to each other on Windows 10

Nomad version

Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)

Operating system and Environment details

2 Azure VMs running Windows 10 Enteprise 2019 LTSC in the same vNET.

Issue

Machine 2 can’t connect to machine 1 even though the port is indeed reachable.

Reproduction steps

Create 2 VMs in Azure using the Windows 10 Enterprise LTSC image and put them in the same vNET.
Run a nomad server in the first one and then try to join in with the second.

Configs:

Machine 1

data_dir  = "C:\\ProgramData\\nomad\\data\\"

log_file = "C:\\ProgramData\\nomad\\logs\\"
log_rotate_duration = "24h"
log_rotate_max_files = 30

bind_addr = "172.18.0.4"

server {
  enabled          = true
  bootstrap_expect = 2
}

plugin "raw_exec" {
  config {
    enabled = true
  }
}

Machine 2

data_dir  = "C:\\ProgramData\\nomad\\data\\"

log_file = "C:\\ProgramData\\nomad\\logs\\"
log_rotate_duration = "24h"
log_rotate_max_files = 30

bind_addr = "172.18.0.5"

server {
  enabled          = true
  bootstrap_expect = 2
  
  server_join {
    retry_join = ["172.18.0.4:4647"]
  }
}

plugin "raw_exec" {
  config {
    enabled = true
  }
}

Expected Result

Server 2 can properly join server 1.

Actual Result

==> Loaded configuration from C:\ProgramData\nomad\conf\client.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:

       Advertise Addrs: HTTP: 172.18.0.5:4646; RPC: 172.18.0.5:4647; Serf: 172.18.0.5:4648
            Bind Addrs: HTTP: 172.18.0.5:4646; RPC: 172.18.0.5:4647; Serf: 172.18.0.5:4648
                Client: false
             Log Level: INFO
                Region: global (DC: dc1)
                Server: true
               Version: 1.1.2

==> Nomad agent started! Log data will stream in below:

    2021-08-09T09:43:56.954Z [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=C:\ProgramData\nomad\data\plugins
    2021-08-09T09:43:57.008Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2021-08-09T09:43:57.008Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2021-08-09T09:43:57.008Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2021-08-09T09:43:57.008Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2021-08-09T09:43:57.008Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2021-08-09T09:43:57.039Z [INFO]  nomad.raft: initial configuration: index=0 servers=[]
    2021-08-09T09:43:57.039Z [INFO]  nomad.raft: entering follower state: follower="Node at 172.18.0.5:4647 [Follower]" leader=
    2021-08-09T09:43:57.041Z [INFO]  nomad: serf: EventMemberJoin: site-test-2.global 172.18.0.5
    2021-08-09T09:43:57.041Z [INFO]  nomad: starting scheduling worker(s): num_workers=4 schedulers=[service, batch, system, _core]
    2021-08-09T09:43:57.041Z [INFO]  nomad: adding server: server="site-test-2.global (Addr: 172.18.0.5:4647) (DC: dc1)"
    2021-08-09T09:43:57.049Z [INFO]  agent.joiner: starting retry join: servers=172.18.0.4:4647
    2021-08-09T09:43:57.053Z [WARN]  agent.joiner: join failed: error="1 error occurred:
        * Failed to join 172.18.0.4: read tcp 172.18.0.5:51277->172.18.0.4:4647: wsarecv: An existing connection was forcibly closed by the remote host.

" retry=30s
    2021-08-09T09:43:58.043Z [ERROR] nomad: error looking up Nomad servers in Consul: error="server.nomad: unable to query Consul datacenters: Get "http://127.0.0.1:8500/v1/catalog/datacenters": dial tcp 127.0.0.1:8500: connectex: No connection could be made because the target machine actively refused it."
    2021-08-09T09:43:58.367Z [WARN]  nomad.raft: no known peers, aborting election

Remarks

Running Test-NetConnection -ComputerName "172.18.0.4" -Port 4647 from machine 2 while nomad is running on 1 outputs the following, which means the server 1 is reachable at 172.18.0.4 via TCP port 4647.
Running the same command while nomad is NOT running on server 1 results in TcpTestSucceeded: False.
So while the connection seems to be working, Nomad is unable to join for some reason.

ComputerName     : 172.18.0.4
RemoteAddress    : 172.18.0.4
RemotePort       : 4647
InterfaceAlias   : Ethernet 3
SourceAddress    : 172.18.0.5
TcpTestSucceeded : True

Hi @dg-eparizzi :wave:

Thank you for the detailed information.

While this is probably not the cause of the issue, one thing that you should keep in mind is that you need to have an odd number of servers in order to establish a cluster leadership, so your bootstrap_expect should be at least 3.

Checkout this page for more details:

Now, back at the network connectivity issues :slightly_smiling_face:

Unfortunately I don’t have a lot to help here. An existing connection was forcibly closed by the remote host. is a generic error message that indicates the remote host refused the connection, and this could be caused by several reasons.

Do you see anything in the server 1 logs?

Could you try again with a lower log_level? Perhaps a TRACE level so we can gather more information.