Vault cluster locking up adding new nodes

We’re attempting to move a cluster to use AWS IAM instance attached roles to access our KMS unseal key to replace a stored key and secret. We are running a 3-node HA cluster utilising a raft backend.

Our procedure is:

  • Create 3 new Vault nodes to replace the existing 3 in the HA cluster
  • Bring the new nodes up one by one, wait for autopilot to show that they are alive
  • Remove the old voter nodes from the cluster
  • With the old voter nodes gone, step down the old leader
  • Shut the old machines down entirely

We run Vault in four different environments and have completed this transition in one environment thus far without too much of an issue.

The next environment is a bit different - here we have a much larger Raft DB, which results in very sluggish startup times since upgrading to 1.8 - this is as a result of the change to the freelist sync bahaviour that affected write performance in older versions of vault (Raft startup is very slow after upgrading to Vault Enterprise from Vault OSS – HashiCorp Help Center). Nodes can take over 1 hour to become ready, the Raft DB is 21Gb on disk mostly due to the regular creation of certificates and tokens for mTLS.

The hope with the gradual rollout of new nodes and spinning down of old ones was that we’d avoid having to restart any existing nodes until the new ones were fully ready. Unfortunately, after all new nodes were brought up (6 in total in the cluster), when trying to run vault operator raft remove-peer for one of the old nodes, the request timed out. At this point, it appeared that the entire Vault cluster was unresponsive. We waited perhaps 20 minutes here before deciding to restart all of the vault services, including the leader and then had to wait the usual hour+ for the cluster to become responsive again.

It’s not entirely certain that running the remove-peer operation is what caused the cluster to freeze up, that might just have happened anyway after a given period of time when the new nodes joined the cluster. It’s also possible if we’d just waited an hour in the first instance the cluster would have come back. Ultimately, the remove-peer operation never actually did anything, at present all 6 nodes are still in the cluster.

The question we have is why the entire cluster would lock up in this situation. The new nodes had retry_join blocks for the old nodes in the raft config inside config.hcl, along with blocks for the new nodes.

Below is an exemplar config that represents what we have on each of the new nodes:

storage "raft" {
	node_id = “vault-1-integration-alpha”
	path = "/var/vault"
    retry_join {
     leader_api_addr = "https://<old_node_1>:8200"
     leader_ca_cert_file = “<some_path>/intermediate.bundle"
    }
    retry_join {
     leader_api_addr = "https://<old_node_2>:8200"
     leader_ca_cert_file = “<some_path>/intermediate.bundle”
    }
    retry_join {
     leader_api_addr = "https://<old_node_3>:8200"
     leader_ca_cert_file = "<some_path>/intermediate.bundle"
    }
    retry_join {
     leader_api_addr = "https://<new_node_2>:8200"
     leader_ca_cert_file = "<some_path>/intermediate.bundle"
    }
    retry_join {
     leader_api_addr = "https://<new_node_3>:8200"
     leader_ca_cert_file = “<some_path>/intermediate.bundle"
    }
}
seal "awskms" {
 kms_key_id = “<kms_key_arn>”
 region   = "eu-west-2"
}
telemetry {
 prometheus_retention_time = "10m"
 disable_hostname = true
}
listener "tcp" {
 address = "0.0.0.0:8200"
 tls_disable = 0
 tls_cert_file = “…/vault.cert"
 tls_key_file = “…/vault.key"
}
api_addr = "https://<this_node_ip>:8200"
cluster_addr = "https://<this_node_ip>:8201"
disable_mlock = true

Crikey. At some point, you really should investigate whether you could move to usage patterns which involve zero storage for this use case. But that’s a separate issue.

This stands out as not making sense in this order… conceptually, the leader can’t still be the leader if it has been removed already. I’d like to think Raft would already be automatically stepping down at the time of removal, but it’s something to double-check, in case it isn’t being handled gracefully.

Vault logs from multiple nodes, for the period concerned, would be a necessity for trying to assess what happened.

Generally, I don’t think people should be manually setting node_ids.

I assume the values were different, between the old and new nodes?