Nomad version
Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)
Operating system and Environment details
2 Azure VMs running Windows 10 Enteprise 2019 LTSC in the same vNET.
Issue
Machine 2 can’t connect to machine 1 even though the port is indeed reachable.
Reproduction steps
Create 2 VMs in Azure using the Windows 10 Enterprise LTSC image and put them in the same vNET.
Run a nomad server in the first one and then try to join in with the second.
Configs:
Machine 1
data_dir = "C:\\ProgramData\\nomad\\data\\"
log_file = "C:\\ProgramData\\nomad\\logs\\"
log_rotate_duration = "24h"
log_rotate_max_files = 30
bind_addr = "172.18.0.4"
server {
enabled = true
bootstrap_expect = 2
}
plugin "raw_exec" {
config {
enabled = true
}
}
Machine 2
data_dir = "C:\\ProgramData\\nomad\\data\\"
log_file = "C:\\ProgramData\\nomad\\logs\\"
log_rotate_duration = "24h"
log_rotate_max_files = 30
bind_addr = "172.18.0.5"
server {
enabled = true
bootstrap_expect = 2
server_join {
retry_join = ["172.18.0.4:4647"]
}
}
plugin "raw_exec" {
config {
enabled = true
}
}
Expected Result
Server 2 can properly join server 1.
Actual Result
==> Loaded configuration from C:\ProgramData\nomad\conf\client.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:
Advertise Addrs: HTTP: 172.18.0.5:4646; RPC: 172.18.0.5:4647; Serf: 172.18.0.5:4648
Bind Addrs: HTTP: 172.18.0.5:4646; RPC: 172.18.0.5:4647; Serf: 172.18.0.5:4648
Client: false
Log Level: INFO
Region: global (DC: dc1)
Server: true
Version: 1.1.2
==> Nomad agent started! Log data will stream in below:
2021-08-09T09:43:56.954Z [WARN] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=C:\ProgramData\nomad\data\plugins
2021-08-09T09:43:57.008Z [INFO] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2021-08-09T09:43:57.008Z [INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0
2021-08-09T09:43:57.008Z [INFO] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2021-08-09T09:43:57.008Z [INFO] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2021-08-09T09:43:57.008Z [INFO] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2021-08-09T09:43:57.039Z [INFO] nomad.raft: initial configuration: index=0 servers=[]
2021-08-09T09:43:57.039Z [INFO] nomad.raft: entering follower state: follower="Node at 172.18.0.5:4647 [Follower]" leader=
2021-08-09T09:43:57.041Z [INFO] nomad: serf: EventMemberJoin: site-test-2.global 172.18.0.5
2021-08-09T09:43:57.041Z [INFO] nomad: starting scheduling worker(s): num_workers=4 schedulers=[service, batch, system, _core]
2021-08-09T09:43:57.041Z [INFO] nomad: adding server: server="site-test-2.global (Addr: 172.18.0.5:4647) (DC: dc1)"
2021-08-09T09:43:57.049Z [INFO] agent.joiner: starting retry join: servers=172.18.0.4:4647
2021-08-09T09:43:57.053Z [WARN] agent.joiner: join failed: error="1 error occurred:
* Failed to join 172.18.0.4: read tcp 172.18.0.5:51277->172.18.0.4:4647: wsarecv: An existing connection was forcibly closed by the remote host.
" retry=30s
2021-08-09T09:43:58.043Z [ERROR] nomad: error looking up Nomad servers in Consul: error="server.nomad: unable to query Consul datacenters: Get "http://127.0.0.1:8500/v1/catalog/datacenters": dial tcp 127.0.0.1:8500: connectex: No connection could be made because the target machine actively refused it."
2021-08-09T09:43:58.367Z [WARN] nomad.raft: no known peers, aborting election
Remarks
Running Test-NetConnection -ComputerName "172.18.0.4" -Port 4647
from machine 2 while nomad is running on 1 outputs the following, which means the server 1 is reachable at 172.18.0.4 via TCP port 4647.
Running the same command while nomad is NOT running on server 1 results in TcpTestSucceeded: False
.
So while the connection seems to be working, Nomad is unable to join for some reason.
ComputerName : 172.18.0.4
RemoteAddress : 172.18.0.4
RemotePort : 4647
InterfaceAlias : Ethernet 3
SourceAddress : 172.18.0.5
TcpTestSucceeded : True