Unable to join Nomad cluster

Hello!

I have a new Nomad server C which I want it to join an existing Nomad cluster of 2 servers A and B.

nomad agent -config /etc/nomad.d/nomad.hcl -config /etc/nomad.d/server.hcl

However, I get the error “failed to receive: No installed keys could decrypt the message” on the server A when I start the new server C.

This could be either mTLS which is enabled (and working fine) on the 2 servers (A and B) from the cluster, or it could also be the gossip protocol encryption.

The certificate for mTLS was generated using

nomad tls cert create -ca path-to-ca.pem -key path-to-ca-key.pem -server -additional-ipaddress "192.168.1.48" -additional-dnsname "nugget.global.nomad"

and was then moved to the new server C.
(Note that the IP address in the certificate do not correspond to the IP address of the new host, but this is not a problem since the server B has the same “mistake” and it works just fine.)

The gossip encryption key is the same (I used vimdiff between the 3 server config files to make sure there was no typo), which does not match with the only other thread I found online about this error.

I tried running both servers A and C with -log-level TRACE but didn’t get any more log.

I tried to join A with it being either leader or follower with no luck.

I even tried to join B instead of A, still the same error.

You can find the two config files for the new server C below, as well as both logs from A and C when trying to join the cluster.

I hope someone can help or at least, point me in the right direction, since the logs are so poor.

Best regards,
Virgile.

C - nomad.hcl
# Copyright (c) HashiCorp, Inc.
# SPDX-License-Identifier: BUSL-1.1

# Full configuration options can be found at https://developer.hashicorp.com/nomad/docs/configuration

data_dir  = "/opt/nomad/data"
bind_addr = "0.0.0.0"
datacenter = "paris"
enable_syslog = true

tls {
  http = true
  rpc = true

  ca_file = "/etc/nomad.d/security/nomad-agent-ca.pem"

  verify_server_hostname = true
  verify_https_client = true
}

acl {
  enabled = true
}
C - server.hcl
name = "nugget"

server {
  enabled = true
  bootstrap_expect = 3

  encrypt = "********************************************"

  raft_boltdb {
    no_freelist_sync = true
  }
  raft_multiplier = 10

  server_join {
    retry_join = ["192.168.1.21:4648"]
  }

  job_gc_threshold = "168h"
  eval_gc_threshold = "168h"
  batch_eval_gc_threshold = "168h"
  deployment_gc_threshold = "168h"
}

tls {
  cert_file = "/etc/nomad.d/security/global-nugget-nomad.pem"
  key_file = "/etc/nomad.d/security/global-nugget-nomad-key.pem"
}
C - logs
    2025-09-24T13:33:39.158+0200 [INFO]  agent.joiner: starting retry join: agent_mode=server servers=192.168.1.21:4648
    2025-09-24T13:33:39.160+0200 [DEBUG] nomad: memberlist: Initiating push/pull sync with:  192.168.1.21:4648
    2025-09-24T13:33:39.163+0200 [DEBUG] nomad: memberlist: Failed to join 192.168.1.21:4648: No installed keys could decrypt the message
    2025-09-24T13:33:39.163+0200 [WARN]  agent.joiner: join failed: agent_mode=server
  error=
  | 1 error occurred:
  | \t* Failed to join 192.168.1.21:4648: No installed keys could decrypt the message
  |
   retry=30s
A - logs
    2025-09-24T13:33:39.222+0200 [DEBUG] nomad: memberlist: Stream connection from=192.168.1.133:45006
    2025-09-24T13:33:39.223+0200 [ERROR] nomad: memberlist: failed to receive: No installed keys could decrypt the message from=192.168.1.133:45006

Hello 3ligriv

Thank you for posting on Hashicorp Discuss forum.

You are correct, as it does seem like a gossip encryption related issue. Can you please run the below command on all the 3 servers to ensure all servers are using same encryption keys :

nomad operator gossip keyring list

If you’re okay to share the output, Please mask the encryption keys, so that they are mostly hidden. The idea is to see if there’s any stale/wrong key being in use.

Looking forward to hearing from you.

On server A:

Gathering installed encryption keys...
Key
m4TQg5Xd8Vj/4293lOq6yN/koHvzX***************

On server B:

Gathering installed encryption keys...
Key
m4TQg5Xd8Vj/4293lOq6yN/koHvzX***************

On server C (the new one):

Gathering installed encryption keys...
Key
m4TQg5Xd8Vj/4296lOq6yN/koHvzX***************

The key in C is not right, there is a typo, which lasts from the first time I setup the server but corrected immediately (I can’t remember if I had this issue right from the start or if it was another). I did not know that it would persist across multiples restarts, though.

There is a warning about this in the logs but it is the very first line after the start of the agent:

Logs
sept. 24 15:43:04 lamd64-1 systemd[1]: Starting nomad-server.service - Nomad server...
░░ Subject: L'unité (unit) nomad-server.service a commencé à démarrer
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ L'unité (unit) nomad-server.service a commencé à démarrer.
sept. 24 15:43:04 lamd64-1 nomad[35354]: WARNING: keyring exists but -encrypt given, using keyring
sept. 24 15:43:04 lamd64-1 nomad[35354]: ==> Config enable_syslog is `true` with log_level=INFO
sept. 24 15:43:04 lamd64-1 nomad[35354]: ==> Loaded configuration from /etc/nomad.d/nomad.hcl, /etc/nomad.d/server.hcl
sept. 24 15:43:04 lamd64-1 nomad[35354]: ==> Starting Nomad agent...
sept. 24 15:43:04 lamd64-1 nomad[35354]:  nomad: setting up raft bolt store: no_freelist_sync=true
sept. 24 15:43:04 lamd64-1 nomad[35354]:  nomad.raft: initial configuration: index=0 servers=[]
sept. 24 15:43:04 lamd64-1 nomad[35354]:  nomad.raft: entering follower state: follower="Node at 192.168.1.133:4647 [Follower]" leader-address= leader-id=
sept. 24 15:43:04 lamd64-1 nomad[35354]:  nomad: serf: EventMemberJoin: nugget.global 192.168.1.133
sept. 24 15:43:04 lamd64-1 nomad[35354]:  nomad: starting scheduling worker(s): num_workers=4 schedulers=["service", "batch", "system", "sysbatch", "_core"]
sept. 24 15:43:04 lamd64-1 nomad[35354]:  nomad: serf: Failed to re-join any previously known node
sept. 24 15:43:04 lamd64-1 nomad[35354]:  nomad: started scheduling worker(s): num_workers=4 schedulers=["service", "batch", "system", "sysbatch", "_core"]
sept. 24 15:43:04 lamd64-1 nomad[35354]:  agent: not registering Nomad HTTPS Health Check because verify_https_client enabled
sept. 24 15:43:04 lamd64-1 nomad[35354]:  nomad: adding server: server="nugget.global (Addr: 192.168.1.133:4647) (DC: paris)"
sept. 24 15:43:04 lamd64-1 nomad[35354]: nomad: error looking up Nomad servers in Consul: error="server.nomad: unable to query Consul datacenters: Get \"http://127.0.0.1:8500/v1/catalog/datacenters\": dial tcp 127.0.0.1:8500: connect: connection refused"
sept. 24 15:43:04 lamd64-1 nomad[35354]: ==> Nomad agent configuration:
sept. 24 15:43:04 lamd64-1 nomad[35354]:        Advertise Addrs: HTTP: 192.168.1.133:4646; RPC: 192.168.1.133:4647; Serf: 192.168.1.133:4648
sept. 24 15:43:04 lamd64-1 nomad[35354]:             Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
sept. 24 15:43:04 lamd64-1 nomad[35354]:                 Client: false
sept. 24 15:43:04 lamd64-1 nomad[35354]:              Log Level: INFO
sept. 24 15:43:04 lamd64-1 nomad[35354]:                Node Id: 5817b990-802b-2f9d-f1a4-541e04d3f451
sept. 24 15:43:04 lamd64-1 nomad[35354]:                 Region: global (DC: paris)
sept. 24 15:43:04 lamd64-1 nomad[35354]:                 Server: true
sept. 24 15:43:04 lamd64-1 nomad[35354]:                Version: 1.10.5
sept. 24 15:43:04 lamd64-1 nomad[35354]: ==> Nomad agent started! Log data will stream in below:
sept. 24 15:43:04 lamd64-1 nomad[35354]:     2025-09-24T15:43:04.708+0200 [INFO]  nomad: setting up raft bolt store: no_freelist_sync=true
sept. 24 15:43:04 lamd64-1 nomad[35354]:     2025-09-24T15:43:04.709+0200 [INFO]  nomad.raft: initial configuration: index=0 servers=[]
sept. 24 15:43:04 lamd64-1 nomad[35354]:     2025-09-24T15:43:04.709+0200 [INFO]  nomad.raft: entering follower state: follower="Node at 192.168.1.133:4647 [Follower]" leader-address= leader-id=
sept. 24 15:43:04 lamd64-1 nomad[35354]:     2025-09-24T15:43:04.710+0200 [INFO]  nomad: serf: EventMemberJoin: nugget.global 192.168.1.133
sept. 24 15:43:04 lamd64-1 nomad[35354]:     2025-09-24T15:43:04.710+0200 [INFO]  nomad: starting scheduling worker(s): num_workers=4 schedulers=["service", "batch", "system", "sysbatch", "_core"]
sept. 24 15:43:04 lamd64-1 nomad[35354]:     2025-09-24T15:43:04.710+0200 [WARN]  nomad: serf: Failed to re-join any previously known node
sept. 24 15:43:04 lamd64-1 nomad[35354]:     2025-09-24T15:43:04.710+0200 [INFO]  nomad: started scheduling worker(s): num_workers=4 schedulers=["service", "batch", "system", "sysbatch", "_core"]
sept. 24 15:43:04 lamd64-1 nomad[35354]:     2025-09-24T15:43:04.710+0200 [WARN]  agent: not registering Nomad HTTPS Health Check because verify_https_client enabled
sept. 24 15:43:04 lamd64-1 nomad[35354]:     2025-09-24T15:43:04.710+0200 [INFO]  nomad: adding server: server="nugget.global (Addr: 192.168.1.133:4647) (DC: paris)"
sept. 24 15:43:04 lamd64-1 nomad[35354]:     2025-09-24T15:43:04.710+0200 [ERROR] nomad: error looking up Nomad servers in Consul: error="server.nomad: unable to query Consul datacenters: Get \"http://127.0.0.1:8500/v1/catalog/datacenters\": dial tcp 127.0.0.1:8500: connect: con>
sept. 24 15:43:04 lamd64-1 nomad[35354]:     2025-09-24T15:43:04.711+0200 [INFO]  agent.joiner: starting retry join: agent_mode=server servers=192.168.1.21:4648
sept. 24 15:43:04 lamd64-1 systemd[1]: Started nomad-server.service - Nomad server.
░░ Subject: L'unité (unit) nomad-server.service a terminé son démarrage
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ L'unité (unit) nomad-server.service a terminé son démarrage, avec le résultat done.
sept. 24 15:43:04 lamd64-1 nomad[35354]:  agent.joiner: starting retry join: agent_mode=server servers=192.168.1.21:4648
sept. 24 15:43:04 lamd64-1 nomad[35354]:     2025-09-24T15:43:04.716+0200 [WARN]  agent.joiner: join failed: agent_mode=server
sept. 24 15:43:04 lamd64-1 nomad[35354]:   error=
sept. 24 15:43:04 lamd64-1 nomad[35354]:   | 1 error occurred:
sept. 24 15:43:04 lamd64-1 nomad[35354]:   | \t* Failed to join 192.168.1.21:4648: No installed keys could decrypt the message
sept. 24 15:43:04 lamd64-1 nomad[35354]:   |
sept. 24 15:43:04 lamd64-1 nomad[35354]:    retry=30s
sept. 24 15:43:04 lamd64-1 nomad[35354]:  agent.joiner: join failed: agent_mode=server
                                           error=
                                           | 1 error occurred:
                                           | \t* Failed to join 192.168.1.21:4648: No installed keys could decrypt the message
                                           |
                                            retry=30s

I installed the correct key using nomad operator gossip keyring install <right key> followed by ... use <right key> and finally ... remove <wrong key> and it eventually worked!

Many thanks for the help!

Best regards,
Virgile.

1 Like

Np, Good to hear that.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.