Get Started Clustering Example Not working

Hi, I recently wanted to use nomad clustering on two separate servers, one running a nomad server agent and one using a nomad client agent.

So I basically started following the example from this part of the nomad getting started clustering section:

However I’ve found that it doesn’t seem to actually work (?) I’m able to start the server.hcl and it runs fine, and the client1.hcl starts up fin and gets to a ready state. But then it ends up having a heartbeat error and degrades to a down state. I see these errors in the client:

    2020-11-12T03:38:59.074Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: dial tcp 10.10.0.5:4647: connect: connection refused" rpc=Node.Register server=10.10.0.5:4647
    2020-11-12T03:38:59.074Z [ERROR] client: error registering: error="rpc error: failed to get conn: dial tcp 10.10.0.5:4647: connect: connection refused"
    2020-11-12T03:39:03.249Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: dial tcp 10.10.0.5:4647: connect: connection refused" rpc=Node.UpdateStatus server=10.10.0.5:4647
    2020-11-12T03:39:03.249Z [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: failed to get conn: dial tcp 10.10.0.5:4647: connect: connection refused" period=1.238717384s
    2020-11-12T03:39:03.250Z [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get "http://127.0.0.1:8500/v1/catalog/datacenters": dial tcp 127.0.0.1:8500: connect: connection refused"

I’m not sure what has caused it to fail on heartbeats but I’ve retried all the steps several times and it happens consistently. If anyone else has run into this and figured it out, I would really appreciate it if you could tell me how you got around this issue.

kennetpostigo,

Firstly, sorry to hear that you are having trouble with the Getting Started tutorial.

The specific error that you’ve highlighted only indicates that Nomad can not use Consul to locate a server. For the getting-started scenario, Consul is unnecessary because the Nomad server to join is specified in the client’s configuration.

Seeing the connection refused messages on the Node RPCs makes me wonder if there might not be a firewall impeding your process. You might find that using the Vagrant environment more welcoming, since it reduces a significant number of variables that you might encounter based on the machine you are running on.

If you could provide more information about your environment, that could help me provide more targeted suggestions for getting Nomad working on your machine. A gist of your configuration files and client and server logs could also be helpful if you are willing to provide them.

Hopefully we can get you unjammed.

Kind Regards,
Charlie Voiselle
Product Education Engineer, Nomad

Hi @angrycub

The reason for why i’m not using vagrant is that i’m actually trying these out in active droplets. I’ve checked and neither digital ocean droplets have a firewall up.

I’m using 3 Digital Ocean $5 droplets with 1 running a server agent and 2 running client agents. The configuration files I have are the same exact ones from the tutorial on clustering.

I don’t have the server/client logs anymore at this time, but later today I can spin things back up and try to capture them.

Ah, Digital Ocean… I see. I was able to reproduce your issue using Droplets. When the bind_addr is set to be 0.0.0.0 as it is in the sample configuration, Nomad will look for the first private address to configure as its advertise address. On a DO droplet, that tends to be the machine’s private (but not VPC) address.

I imagine that you, like me, configured your clients to talk to your Nomad server over the VPC version of the address, and started them up. As you’d mentioned it works fine for a minute until this event happens:

    2020-11-12T21:39:28.307Z [DEBUG] client.server_mgr: new server list: new_servers=[10.10.0.5:4647] old_servers=[10.116.0.2:4647]

The server has now updated the client with the address that it can not use to talk to it, and breaks things.

To resolve this, you need to add a little additional configuration to your server configuration and wipe the state. Wiping the state is simpler than the production mitigation of peers.json recovery.

In your server.hcl file, at the top level (I placed mine right below the name value) add:

advertise {
  http = "10.116.0.2"
  rpc  = "10.116.0.2"
  serf = "10.116.0.2"
}

and setting the value to your servers’s VPC address. If your droplets to like mine did, your VPC address is bound in on eth1. You can also use go-sockaddr formatting to get the value

advertise {
  http = "{{GetInterfaceIP \"eth1\"}}"
  rpc  = "{{GetInterfaceIP \"eth1\"}}"
  serf = "{{GetInterfaceIP \"eth1\"}}"
}

Once you do this, your server’s address will no longer match the sole member’s IP address in the saved state. This will prevent the node from starting. Delete the state directory.

$ rm -rf /tmp/server1/*

Then you should be able to start the server, and start the two clients without issue.

The way that guide presents the material is really aimed at someone who doesn’t want to build actual infrastructure. Since you are willing to build actual compute elements, I would encourage you to visit the Connect Nodes into a Cluster tutorial which talks about this process in a more real way.

I will also see about adding a note to that page to explain the context and intention of those configurations. Definitely don’t need more adventurous folks falling into that trap.

Thanks for giving Nomad a whirl. Hope this gets you unblocked!

@angrycub thanks for the help and the link the other clustering section!

I have a question about the nomad clustering and the consul clustering.

  1. Do you need a Consul Datacenter to use consul to “auto-cluster” a nomad cluster?

  2. If what i’m reading is correct, is consul clustering basically the same thing as creating a “Consul Datacenter”? Is there a guide you recommend for setting up both nomad and consul clusters on machines (digital ocean droplets)? I don’t mind using terraform, but I don’t have any experience with vagrant :frowning:

  3. From the diagrams i’m reading, the way nomad and consul detect each other is because both nomad and consul client agents run on the same machine?

If you want to zoom out quite a bit further, you can look at the Reference Architectures for Nomad, but to answer your questions here:

  1. Yes, you would need a Consul cluster configured for Nomad to be able to autodiscover the other nodes using the Consul integration.

  2. Yep, a consul cluster and datacenter are essentially synonymous on the small scale. A Consul Datacenter is a gossip pool of agents, and is parallel to a Nomad Region.

  3. In a typical deployment, you have a Consul cluster, and then you deploy local Consul agents on all of your other nodes. These local Consul agents are what your applications like Nomad should be configured to talk to.

    Also note, Nomad nodes can not share a Consul agent, because they expect to be authoritative for all of the Consul services and checks that are “nomad-shaped” and keep their state synchronized with an internal table. If more than one Nomad client shares the same Consul agent they see the checks each other registered and start deleting each others checks as the sync processes compete against each other.

In your layout, you could create a single node Consul server on your Nomad server and deploy consul agents to your Nomad client nodes. For Consul, you will have to either manually join your nodes as discussed above with Nomad, or use cloud auto-join for DigitalOcean to simplify having your Consul nodes discover one another.

1 Like