Create 2 independent consul clusters under same network subnet

Dear Consul community,

I am learning Consul and I have 8 nodes under the same network subnet, I would like to configure 3 independent consul clusters under the same network, each cluster would have 3 nodes running as both servers.

I tried this configuration:

cluster 1

# cat /etc/nomad.d/server.hcl
server {
  enabled = true
  bootstrap_expect = 3
  server_join {
    retry_join = ["nid001308", "nid001309", "nid001310"]
  }
}

and

# cat /etc/nomad.d/nomad.hcl
log_level = "DEBUG"
datacenter = "test1"
data_dir = "/opt/nomad"

tls {
  http = true
  rpc  = true

  ca_file   = "/root/nomad-agent-ca.pem"
  cert_file = "/root/global-server-nomad.pem"
  key_file  = "/root/global-server-nomad-key.pem"

  verify_server_hostname = true
  verify_https_client    = true
}

acl {
  enabled = true
}

cluster 2

# cat /etc/nomad.d/server.hcl
server {
  enabled = true
  bootstrap_expect = 3
  server_join {
    retry_join = ["nid002588", "nid002590", "nid002832"]
  }
}

and

# cat /etc/nomad.d/nomad.hcl
log_level = "DEBUG"
datacenter = "test1"
data_dir = "/opt/nomad"

tls {
  http = true
  rpc  = true

  ca_file   = "/root/nomad-agent-ca.pem"
  cert_file = "/root/global-server-nomad.pem"
  key_file  = "/root/global-server-nomad-key.pem"

  verify_server_hostname = true
  verify_https_client    = true
}

acl {
  enabled = true
}

I am reusing the same TLS certs and telling the agent which nodes are servers.

This is not working for me, meaning, the first cluster runs fine, but the second fails because it can’t find a leader

Below the logs of one of the nomad servers related to the second cluster affected by this problem

Jul 26 11:46:37 nid002588 nomad[35853]: ==> Nomad agent configuration:
Jul 26 11:46:37 nid002588 nomad[35853]:        Advertise Addrs: HTTP: 172.17.0.1:4646; RPC: 172.17.0.1:4647; Serf: 172.17.0.1:4648
Jul 26 11:46:37 nid002588 nomad[35853]:             Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
Jul 26 11:46:37 nid002588 nomad[35853]:                 Client: false
Jul 26 11:46:37 nid002588 nomad[35853]:              Log Level: DEBUG
Jul 26 11:46:37 nid002588 nomad[35853]:                 Region: global (DC: psitds)
Jul 26 11:46:37 nid002588 nomad[35853]:                 Server: true
Jul 26 11:46:37 nid002588 nomad[35853]:                Version: 1.5.6
Jul 26 11:46:37 nid002588 nomad[35853]: ==> Nomad agent started! Log data will stream in below:
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.529+0200 [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.529+0200 [INFO]  nomad.raft: initial configuration: index=0 servers=[]
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.529+0200 [INFO]  nomad.raft: entering follower state: follower="Node at 172.17.0.1:4647 [Follower]" leader-address= leader-id=
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.529+0200 [INFO]  nomad: serf: EventMemberJoin: nid002588.global 172.17.0.1
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.529+0200 [INFO]  nomad: starting scheduling worker(s): num_workers=128 schedulers=["service", "batch", "system", "sysbatch", "_core"]
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.529+0200 [DEBUG] nomad: started scheduling worker: id=c262719b-a3de-f607-cff5-9b0fbce5197c index=1 of=128
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.529+0200 [DEBUG] nomad: started scheduling worker: id=5ebc69f3-c238-663b-abd4-acee3e267302 index=2 of=128
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.529+0200 [DEBUG] worker: running: worker_id=c262719b-a3de-f607-cff5-9b0fbce5197c
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.529+0200 [DEBUG] worker: running: worker_id=79cd4b6c-cf06-cce7-7083-eee5d0e17546
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.529+0200 [DEBUG] nomad: started scheduling worker: id=79cd4b6c-cf06-cce7-7083-eee5d0e17546 index=3 of=128
...
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.532+0200 [DEBUG] worker: running: worker_id=e3cd3b1a-b6d0-b6ed-563b-627fd6f19901
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.532+0200 [DEBUG] worker: running: worker_id=9fe4456b-80e0-d61e-96be-f4dc282cc263
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.532+0200 [DEBUG] worker: running: worker_id=c3082011-4742-5173-2634-e27e59b55023
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.532+0200 [INFO]  agent.joiner: starting retry join: servers="nid002588 nid002590 nid002832"
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.532+0200 [DEBUG] nomad: lost contact with Nomad quorum, falling back to Consul for server list
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.532+0200 [INFO]  nomad: adding server: server="nid002588.global (Addr: 172.17.0.1:4647) (DC: psitds)"
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.532+0200 [DEBUG] nomad.keyring.replicator: starting encryption key replication
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.533+0200 [ERROR] nomad: error looking up Nomad servers in Consul: error="server.nomad: unable to query Consul datacenters: Get \"http://127.0.0.1:8500/v1/catalog/datacenters\": dial tcp 127.0.0.1:8500: connect: connection refused"
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.534+0200 [DEBUG] nomad: memberlist: Initiating push/pull sync with:  148.187.115.35:4648
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.534+0200 [DEBUG] nomad: memberlist: Stream connection from=148.187.115.35:40920
...
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.540+0200 [DEBUG] nomad: memberlist: Initiating push/pull sync with:  148.187.115.12:4648
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.540+0200 [DEBUG] nomad: memberlist: Initiating push/pull sync with:  148.187.115.23:4648
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.541+0200 [DEBUG] nomad: memberlist: Initiating push/pull sync with:  148.187.115.24:4648
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.542+0200 [INFO]  nomad: found expected number of peers, attempting to bootstrap cluster...: peers="172.17.0.1:4647,172.17.0.1:4647,172.17.0.1:4647"
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.542+0200 [ERROR] nomad: failed to bootstrap cluster: error="found duplicate address in configuration: 172.17.0.1:4647"
...
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.543+0200 [DEBUG] nomad: memberlist: Initiating push/pull sync with:  148.187.114.213:4648
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.543+0200 [DEBUG] nomad: memberlist: Initiating push/pull sync with:  148.187.114.214:4648
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.544+0200 [DEBUG] nomad: memberlist: Initiating push/pull sync with:  148.187.114.225:4648
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.544+0200 [DEBUG] nomad: memberlist: Initiating push/pull sync with:  148.187.114.226:4648
Jul 26 11:46:37 nid002588 nomad[35853]:     2023-07-26T11:46:37.545+0200 [INFO]  agent.joiner: retry join completed: initial_servers=12 agent_mode=server
Jul 26 11:46:38 nid002588 nomad[35853]:     2023-07-26T11:46:38.030+0200 [DEBUG] nomad: serf: messageJoinType: nid002588.global
Jul 26 11:46:38 nid002588 nomad[35853]:     2023-07-26T11:46:38.030+0200 [DEBUG] nomad: serf: messageJoinType: nid002588.global
Jul 26 11:46:38 nid002588 nomad[35853]:     2023-07-26T11:46:38.530+0200 [DEBUG] nomad: serf: messageJoinType: nid002588.global
Jul 26 11:46:38 nid002588 nomad[35853]:     2023-07-26T11:46:38.530+0200 [DEBUG] nomad: serf: messageJoinType: nid002588.global
Jul 26 11:46:38 nid002588 nomad[35853]:     2023-07-26T11:46:38.921+0200 [WARN]  nomad.raft: no known peers, aborting election
Jul 26 11:46:42 nid002588 nomad[35853]:     2023-07-26T11:46:42.530+0200 [WARN]  nomad: memberlist: Got ping for unexpected node 'nid002832.global' from=10.100.24.11:4648
...
Jul 26 11:46:45 nid002588 nomad[35853]:     2023-07-26T11:46:45.531+0200 [DEBUG] nomad: memberlist: Failed UDP ping: nid002832.global (timeout reached)
Jul 26 11:46:45 nid002588 nomad[35853]:     2023-07-26T11:46:45.531+0200 [WARN]  nomad: memberlist: Got ping for unexpected node 'nid002832.global' from=10.100.24.11:4648
Jul 26 11:46:45 nid002588 nomad[35853]:     2023-07-26T11:46:45.531+0200 [DEBUG] nomad: memberlist: Stream connection from=10.100.24.11:56900
Jul 26 11:46:45 nid002588 nomad[35853]:     2023-07-26T11:46:45.531+0200 [WARN]  nomad: memberlist: Got ping for unexpected node nid002832.global from=10.100.24.11:56900
Jul 26 11:46:45 nid002588 nomad[35853]:     2023-07-26T11:46:45.532+0200 [ERROR] nomad: memberlist: Failed fallback TCP ping: EOF
Jul 26 11:46:46 nid002588 nomad[35853]:     2023-07-26T11:46:46.533+0200 [DEBUG] nomad: lost contact with Nomad quorum, falling back to Consul for server list
Jul 26 11:46:47 nid002588 nomad[35853]:     2023-07-26T11:46:47.531+0200 [INFO]  nomad: memberlist: Suspect nid002832.global has failed, no acks received
Jul 26 11:46:47 nid002588 nomad[35853]:     2023-07-26T11:46:47.531+0200 [WARN]  nomad: memberlist: Got ping for unexpected node 'nid002590.global' from=10.100.24.11:4648

Any hint?

Hi @masuberu,

Ideally you would separate the Consul clusters onto different VPCs or VLANs so that the gossip pools reside on a different layer 2 broadcast domain.

If you’re unable to do this, you’ll need to configure each cluster (Nomad included) to use different gossip encryption keys so that you can prevent nodes in the cluster 2 from joining cluster 1’s gossip pool.

The gossip encryption key can be specified in the Consul agent configuration file using the encrypt configuration parameter, or at the CLI using the -encrypt flag.

See the the following tutorials for a step-by-step walkthrough for how to configure this in Consul and Nomad.

This indicates that the Consul agent is not listening on port 8500. Take a look at the Consul agent’s logs to see why it may not be binding to this interface on startup.