Failed to join raft cluster through Ansible

hidariyume · March 23, 2022, 1:08pm

Hi, we are trying to create a Vault cluster with Raft as backend storage through terraform provisioning that at the end does a local-exec in a null_resource to execute an ansible playbook with a Vault Role we’ve created.

The master node is initialized correctly and unsealed without any issues, but when we attempt to join the secondary nodes to the cluster we get the following error:

"Error joining the node to the Raft cluster: Error making API request.

URL: POST http://node2.ip.address:8200/v1/sys/storage/raft/join Code: 500. Errors:
* failed to join raft cluster: failed to join any raft leader node"

If I were to SSH to the host and rerun the exact same command that ansible is running manually then the node is able to join the master, without any issues, and I can proceed with unsealing the secondary nodes.

This is the error output from ansible:

 failed: [test-deploy3] (item=test-deploy1) => {
│     "ansible_loop_var": "item",
│     "changed": true,
│     "cmd": [
│         "vault",
│         "operator",
│         "raft",
│         "join",
│         "-address=http://node3.ip.address:8200",
│         "http://node1.ip.address:8200"
│     ],
│     "delta": "0:00:00.076292",
│     "end": "2022-03-23 07:42:31.598838",
│     "invocation": {
│         "module_args": {
│             "_raw_params": "vault operator raft join -address=\"http://node3.ip.address:8200\" http://node1.ip.address:8200",
│             "_uses_shell": false,
│             "argv": null,
│             "chdir": null,
│             "creates": null,
│             "executable": null,
│             "removes": null,
│             "stdin": null,
│             "stdin_add_newline": true,
│             "strip_empty_ends": true,
│             "warn": true
│         }
│     },
│     "item": "test-deploy1",
│     "msg": "non-zero return code",
│     "rc": 2,
│     "start": "2022-03-23 07:42:31.522546",
│     "stderr": "Error joining the node to the Raft cluster: Error making API request.\n\nURL: POST http://node3.ip.address:8200/v1/sys/storage/raft/join\nCode: 500. Errors:\n\n* failed to join raft cluster: failed to join any raft leader node",
│     "stderr_lines": [
│         "Error joining the node to the Raft cluster: Error making API request.",
│         "",
│         "URL: POST http://node3.ip.address:8200/v1/sys/storage/raft/join",
│         "Code: 500. Errors:",
│         "",
│         "* failed to join raft cluster: failed to join any raft leader node"
│     ],
│     "stdout": "",
│     "stdout_lines": []
│ }
│ failed: [test-deploy2] (item=test-deploy1) => {
│     "ansible_loop_var": "item",
│     "changed": true,
│     "cmd": [
│         "vault",
│         "operator",
│         "raft",
│         "join",
│         "-address=http://node2.ip.address:8200",
│         "http://node1.ip.address:8200"
│     ],
│     "delta": "0:00:00.068823",
│     "end": "2022-03-23 07:42:31.050537",
│     "invocation": {
│         "module_args": {
│             "_raw_params": "vault operator raft join -address=\"http://node2.ip.address:8200\" http://node1.ip.address:8200",
│             "_uses_shell": false,
│             "argv": null,
│             "chdir": null,
│             "creates": null,
│             "executable": null,
│             "removes": null,
│             "stdin": null,
│             "stdin_add_newline": true,
│             "strip_empty_ends": true,
│             "warn": true
│         }
│     },
│     "item": "test-deploy1",
│     "msg": "non-zero return code",
│     "rc": 2,
│     "start": "2022-03-23 07:42:30.981714",
│     "stderr": "Error joining the node to the Raft cluster: Error making API request.\n\nURL: POST http://node2.ip.address:8200/v1/sys/storage/raft/join\nCode: 500. Errors:\n\n* failed to join raft cluster: failed to join any raft leader node",
│     "stderr_lines": [
│         "Error joining the node to the Raft cluster: Error making API request.",
│         "",
│         "URL: POST http://node2.ip.address:8200/v1/sys/storage/raft/join",
│         "Code: 500. Errors:",
│         "",
│         "* failed to join raft cluster: failed to join any raft leader node"
│     ],
│     "stdout": "",
│     "stdout_lines": []
│ }
│ META: noop
│ META: noop
│ META: ran handlers
│ META: ran handlers

This is the task that Ansible is running:

- name: Join raft cluster to leader
  command:
    cmd: vault operator raft join -address="http://{{ ansible_default_ipv4.address }}:8200" {{ hostvars[item]['ansible_default_ipv4']['address'] }}:8200
    chdir: /opt/vault
  with_items : "{{ groups.vault_leader }}"
  environment:
    VAULT_ADDR: "http://{{ ansible_default_ipv4.address }}:8200"

I have also tried with this version instead:

- name: Join raft cluster to leader
  command:
    cmd: vault operator raft join -tls-skip-verify http://{{ hostvars[item]['ansible_default_ipv4']['address'] }}:8200
    chdir: /opt/vault
  with_items : "{{ groups.vault_leader }}"
  environment:
    VAULT_ADDR: "http://{{ ansible_default_ipv4.address }}:8200"

This is the template being used for the vault config:

# Define the storage node-ID and path
storage "raft" {
  node_id           = "{{ ansible_hostname }}"
  path              = "/opt/vault/data"
}

# Set up externally available APIs
api_addr            = "http://{{ ansible_default_ipv4.address }}:8200"
cluster_addr        = "http://{{ ansible_default_ipv4.address }}:8201"

# Specify the cluster name
cluster_name        = "{{ vault_cluster_name }}"

# Start listener TCP listener for APIs
listener "tcp" {
  address           = "0.0.0.0:8200"
  cluster_address   = "0.0.0.0:8201"
  tls_disable       = true
}

default_lease_ttl   = "20m"
disable_mlock       = true
disable_cache       = true 
ui                  = true

I have also tried joining a node manually, then manually removing it from the peer and re-running the playbook and in those cases the ansible-task manages to join the node to the cluster without any issues.

I have been trying to troubleshoot this for a few days now and can’t seem to figure out where it’s going wrong, I don’t know if I’m just missing something obvious here.

Git repository for the role is currently in a project in a private GitLab instance but I have uploaded relevant files to my GitHub.

Any help or ideas are greatly appreciated!

aram · March 24, 2022, 10:51am

I have never tried this but I think the problem is that you’re setting up Vault, not actually setting up a cluster then trying to join nodes to it, which isn’t going to work. For a node to join a cluster, the cluster must be initialized, up and healthy.

Depends on where you’re deploying to (cloud, kub, instances, etc) there are different methods of setting up a vault cluster from scratch.

My suggestion is do the steps manually, figure out what works for your deployment, then try to automate it after.

hidariyume · March 24, 2022, 11:01am

When I run the playbook and role up excluding the followers.yml-tasks and then run the join-command manually on each host it works fine and I get a working cluster though.

So it’s specifically when I’m running it through the ansible-task that it fails.

Am I missing any steps in regards to setting up the master node from what you can see?? My coworker who wrote the tasks for initializing the master node followed the documentation here: Vault HA Cluster with Integrated Storage | Vault - HashiCorp Learn

But as stated, if I run the procedures step by step manually there are no issues, it’s specifically when the join command is run through ansible that it fails.

Thanks in advance!

aram · March 24, 2022, 11:13am

Okay gret. Then I would ask in an ansible forum. There is nothing about an ansible playbook that Vault would know anything about.

jeffsanicola · March 24, 2022, 11:59am

I don’t have much of any experience with Ansible either. However, one key difference between running the commands manually vs through automation is how quickly the commands are issued. Would you need to add a short pause between the deployment of the first node before the 2nd/3rd attempt to join?

hidariyume · March 24, 2022, 1:53pm

Hmm, that might be worth a try actually. I’ll give it a shot, thanks!

hidariyume · March 24, 2022, 5:21pm

Putting a 1 minute pause seems to have done the trick.

Are there any polling commands for example that I could be using to poll the master for when it is ready to accept nodes/accept commands in Vault instead of just a straight up pause that you know of? Would be preferable to run some sort of health check in that fashion instead the pause-command if it is possible.

Thanks a ton for the initial idea though!

jeffsanicola · March 24, 2022, 6:03pm

I’m not sure offhand.

I’d probably start with sys/storage/raft/configuration and/or /sys/health. Depending what your process all does, perhaps reading sys/init would be helpful.

You’ll probably need to experiment a bit to get it sorted out, however.