Vault operator raft join seems to work, but operator raft list-peers shows only leader

I am using the helm chart to deploy 3 pods to an on-prem k8s cluster. I want to use raft HA so I am using the following config:

ui = true

listener "tcp" {
    address = "[::]:8200"
    cluster_address = "[::]:8201"
    tls_cert_file = "/vault/userconfig/vault/vault.pem"
    tls_key_file = "/vault/userconfig/vault/vault-key.pem"
    tls_client_ca_file = "/vault/userconfig/vault/vault-ca.pem"
}
storage "raft" {
    path = "/vault/data"
}
service_registration "kubernetes" {}

I execute the following commands on the 1st pod and everything seems to work correctly (I am kubetailing all the logs to monitor eventual errors):

vault operator init
vault operator unseal
vault operator raft list-peers

list-peers shows the pod I am connected to as leader…all good so far.

Node                                    Address                              State     Voter
----                                    -------                              -----     -----
1b50fad0-ccad-8330-672c-62cb8e0d63fe    vault-0.vault-internal:8201    leader    true

Then I connect to the next pod and enter:

vault operator init
vault operator raft join https://vault-0.vault-internal:8200
vault operator unseal

…and raft join outputs:

Key       Value
---       -----
Joined    true

But when I then connect back to the leader and check the peers again I still see only the leader and not a 2nd node as I would expect:

/ $ vault operator raft list-peers
Node                                    Address                              State     Voter
----                                    -------                              -----     -----
1b50fad0-ccad-8330-672c-62cb8e0d63fe    vault-0.vault-internal:8201    leader    true

The logs don’t show any error and raft join shows “Joined=true”, but still it seems like the join has not worked.

Any other method I can troubleshoot this?
Any error in my config somebody sees?

Thanks!

I don’t know why it’s reporting success with the ‘raft join’ but I suspect part of your problem is the ‘init’ on the second pod. You only ‘init’ the initial node in the cluster, the subsequent nodes join the cluster and use the same seal as the initial leader.

Thanks @nhw76!
Now I only did a init and unseal on the first node and then a raft join on the 2nd node. After this I still don’t see the node in raft list-peers, but when I do an unseal on the 2nd node I finally see it listed as a follower. So far so good.

…but :slight_smile:
When I check the logs I see the following INFO/WARN/ERROR messages coming from the leader:

[vault-0] 2020-05-12T11:24:32.373Z [INFO]  storage.raft: updating configuration: command=AddStaging server-id=801e9c2c-e2a2-f650-1e8d-c3aed9b361f6 server-addr=vault-1.vault-internal:8201 servers="[{Suffrage:Voter ID:f21d4e79-2597-ebe0-23ee-e8d629f5c327 Address:vault-0.vault-internal:8201} {Suffrage:Voter ID:801e9c2c-e2a2-f650-1e8d-c3aed9b361f6 Address:vault-1.vault-internal:8201}]" 
[vault-0] 2020-05-12T11:24:32.378Z [INFO]  storage.raft: added peer, starting replication: peer=801e9c2c-e2a2-f650-1e8d-c3aed9b361f6 
[vault-0] 2020-05-12T11:24:32.380Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 801e9c2c-e2a2-f650-1e8d-c3aed9b361f6 vault-1.vault-internal:8201}" error="dial tcp 192.168.2.139:8201: connect: connection refused" 
[vault-0] 2020-05-12T11:24:32.380Z [INFO]  system: follower node answered the raft bootstrap challenge: follower_server_id=801e9c2c-e2a2-f650-1e8d-c3aed9b361f6 
[vault-0] 2020-05-12T11:24:32.616Z [WARN]  storage.raft: appendEntries rejected, sending older logs: peer="{Voter 801e9c2c-e2a2-f650-1e8d-c3aed9b361f6 vault-1.vault-internal:8201}" next=2 
[vault-0] 2020-05-12T11:24:32.623Z [INFO]  storage.raft: pipelining replication: peer="{Voter 801e9c2c-e2a2-f650-1e8d-c3aed9b361f6 vault-1.vault-internal:8201}" 

So on the CLI it seems that the cluster is working, but those events still make me nervous. Any ideas?

Thank you!

In my experience, there’s a bit of noise immediately after the cluster join while the new follower proves it has unsealed so log replication can start but it settles down quickly.

I think that looks OK.

2 Likes

Great! Thanks so much for the help! Appreciate it! :slight_smile:

good shout on “settling down”
totally worked, had to give it a minute and all logs gone quiet. Added KV on leader node, logged in to follower nodes, KV value was replicated almost immediately on others.

Config used:

/etc/vault.d/vault.hcl

storage “raft” {
path = “/data/vault.d/raft/”
node_id = “node1.domain.com
}

listener “tcp” {
address = “0.0.0.0:8200”
cluster_address = “0.0.0.0:8201”
tls_disable = false
tls_cert_file = “/etc/vault.d/certs/vaultepcom.pem”
tls_key_file = “/etc/vault.d/certs/vaultepcom.key”
}

api_addr = “https://node1.domain.com:8200
cluster_addr = “https://node1.domain.com:8201
ui = true

/etc/systemd/system/vault.service

[Unit]
Description=a tool for managing secrets
Documentation=https://vaultproject.io/docs/
After=network.target
ConditionFileNotEmpty=/etc/vault.d/vault.hcl

[Service]
User=vault
Group=vault
ExecStart=/usr/local/sbin/vault server -config=/etc/vault.d/vault.hcl
ExecReload=/usr/local/bin/kill --signal HUP $MAINPID
CapabilityBoundingSet=CAP_SYSLOG CAP_IPC_LOCK
Capabilities=CAP_IPC_LOCK+ep
SecureBits=keep-caps
NoNewPrivileges=yes
KillSignal=SIGINT

[Install]
WantedBy=multi-user.target