Hi, we are trying to create a Vault cluster with Raft as backend storage through terraform provisioning that at the end does a local-exec in a null_resource to execute an ansible playbook with a Vault Role we’ve created.
The master node is initialized correctly and unsealed without any issues, but when we attempt to join the secondary nodes to the cluster we get the following error:
"Error joining the node to the Raft cluster: Error making API request.
URL: POST http://node2.ip.address:8200/v1/sys/storage/raft/join Code: 500. Errors:
* failed to join raft cluster: failed to join any raft leader node"
If I were to SSH to the host and rerun the exact same command that ansible is running manually then the node is able to join the master, without any issues, and I can proceed with unsealing the secondary nodes.
This is the error output from ansible:
failed: [test-deploy3] (item=test-deploy1) => {
│ "ansible_loop_var": "item",
│ "changed": true,
│ "cmd": [
│ "vault",
│ "operator",
│ "raft",
│ "join",
│ "-address=http://node3.ip.address:8200",
│ "http://node1.ip.address:8200"
│ ],
│ "delta": "0:00:00.076292",
│ "end": "2022-03-23 07:42:31.598838",
│ "invocation": {
│ "module_args": {
│ "_raw_params": "vault operator raft join -address=\"http://node3.ip.address:8200\" http://node1.ip.address:8200",
│ "_uses_shell": false,
│ "argv": null,
│ "chdir": null,
│ "creates": null,
│ "executable": null,
│ "removes": null,
│ "stdin": null,
│ "stdin_add_newline": true,
│ "strip_empty_ends": true,
│ "warn": true
│ }
│ },
│ "item": "test-deploy1",
│ "msg": "non-zero return code",
│ "rc": 2,
│ "start": "2022-03-23 07:42:31.522546",
│ "stderr": "Error joining the node to the Raft cluster: Error making API request.\n\nURL: POST http://node3.ip.address:8200/v1/sys/storage/raft/join\nCode: 500. Errors:\n\n* failed to join raft cluster: failed to join any raft leader node",
│ "stderr_lines": [
│ "Error joining the node to the Raft cluster: Error making API request.",
│ "",
│ "URL: POST http://node3.ip.address:8200/v1/sys/storage/raft/join",
│ "Code: 500. Errors:",
│ "",
│ "* failed to join raft cluster: failed to join any raft leader node"
│ ],
│ "stdout": "",
│ "stdout_lines": []
│ }
│ failed: [test-deploy2] (item=test-deploy1) => {
│ "ansible_loop_var": "item",
│ "changed": true,
│ "cmd": [
│ "vault",
│ "operator",
│ "raft",
│ "join",
│ "-address=http://node2.ip.address:8200",
│ "http://node1.ip.address:8200"
│ ],
│ "delta": "0:00:00.068823",
│ "end": "2022-03-23 07:42:31.050537",
│ "invocation": {
│ "module_args": {
│ "_raw_params": "vault operator raft join -address=\"http://node2.ip.address:8200\" http://node1.ip.address:8200",
│ "_uses_shell": false,
│ "argv": null,
│ "chdir": null,
│ "creates": null,
│ "executable": null,
│ "removes": null,
│ "stdin": null,
│ "stdin_add_newline": true,
│ "strip_empty_ends": true,
│ "warn": true
│ }
│ },
│ "item": "test-deploy1",
│ "msg": "non-zero return code",
│ "rc": 2,
│ "start": "2022-03-23 07:42:30.981714",
│ "stderr": "Error joining the node to the Raft cluster: Error making API request.\n\nURL: POST http://node2.ip.address:8200/v1/sys/storage/raft/join\nCode: 500. Errors:\n\n* failed to join raft cluster: failed to join any raft leader node",
│ "stderr_lines": [
│ "Error joining the node to the Raft cluster: Error making API request.",
│ "",
│ "URL: POST http://node2.ip.address:8200/v1/sys/storage/raft/join",
│ "Code: 500. Errors:",
│ "",
│ "* failed to join raft cluster: failed to join any raft leader node"
│ ],
│ "stdout": "",
│ "stdout_lines": []
│ }
│ META: noop
│ META: noop
│ META: ran handlers
│ META: ran handlers
This is the task that Ansible is running:
- name: Join raft cluster to leader
command:
cmd: vault operator raft join -address="http://{{ ansible_default_ipv4.address }}:8200" {{ hostvars[item]['ansible_default_ipv4']['address'] }}:8200
chdir: /opt/vault
with_items : "{{ groups.vault_leader }}"
environment:
VAULT_ADDR: "http://{{ ansible_default_ipv4.address }}:8200"
I have also tried with this version instead:
- name: Join raft cluster to leader
command:
cmd: vault operator raft join -tls-skip-verify http://{{ hostvars[item]['ansible_default_ipv4']['address'] }}:8200
chdir: /opt/vault
with_items : "{{ groups.vault_leader }}"
environment:
VAULT_ADDR: "http://{{ ansible_default_ipv4.address }}:8200"
This is the template being used for the vault config:
# Define the storage node-ID and path
storage "raft" {
node_id = "{{ ansible_hostname }}"
path = "/opt/vault/data"
}
# Set up externally available APIs
api_addr = "http://{{ ansible_default_ipv4.address }}:8200"
cluster_addr = "http://{{ ansible_default_ipv4.address }}:8201"
# Specify the cluster name
cluster_name = "{{ vault_cluster_name }}"
# Start listener TCP listener for APIs
listener "tcp" {
address = "0.0.0.0:8200"
cluster_address = "0.0.0.0:8201"
tls_disable = true
}
default_lease_ttl = "20m"
disable_mlock = true
disable_cache = true
ui = true
I have also tried joining a node manually, then manually removing it from the peer and re-running the playbook and in those cases the ansible-task manages to join the node to the cluster without any issues.
I have been trying to troubleshoot this for a few days now and can’t seem to figure out where it’s going wrong, I don’t know if I’m just missing something obvious here.
Git repository for the role is currently in a project in a private GitLab instance but I have uploaded relevant files to my GitHub.
Any help or ideas are greatly appreciated!