How to migrate from non-ha backend to non-ha backend + raft ha_backend

Hi, I’m trying to migrate from an S3 backend (non-HA) to a solution where I keep S3, and introduce ha_backend “raft” so I can run in HA.

Currently I’m facing some issues, since once I add the ha_backend to my configuration and restart the server, the server goes into standby mode:

==> Vault server configuration:

          HA Storage: raft
         Api Address: http://172.16.0.70:8200
                 Cgo: disabled
     Cluster Address: https://vault-0.vault-internal:8201
          Go Version: go1.14.7
          Listener 1: tcp (addr: "[::]:8200", cluster address: "[::]:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")
           Log Level: info
               Mlock: supported: true, enabled: false
       Recovery Mode: false
             Storage: s3
             Version: Vault v1.5.2
         Version Sha: 685fdfa60d607bca069c09d2d52b6958a7a2febd

==> Vault server started! Log data will stream in below:

2020-09-02T16:18:43.934Z [INFO] proxy environment: http_proxy= https_proxy= no_proxy=
2020-09-02T16:19:15.515Z [INFO] core.cluster-listener.tcp: starting listener: listener_address=[::]:8201
2020-09-02T16:19:15.516Z [INFO] core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2020-09-02T16:19:15.588Z [INFO] core: vault is unsealed
2020-09-02T16:19:15.589Z [INFO] core: entering standby mode

When I try to log in, I get this error:

vault login $VAULT_TOKEN
Error authenticating: error looking up token: Error making API request.

URL: GET http://127.0.0.1:8200/v1/auth/token/lookup-self
Code: 500. Errors:

  • local node not active but active cluster node not found
    command terminated with exit code 2

How can I migrate my vault storage from non-HA storage to non-HA storage + ha_storage?

1 Like

Similar issued encountered on 1.5.3 where moving a vault backed by S3 and auto unseal KMS is failing to become active node with HA storage as raft. Cluster automatically goes into standby mode.

This could be a bug, I would recommend creating an issue on Github.

I spent a lot of time with this yesterday. the problem is that raft when used only as HA storage doesn’t bootstrap raft in vault.

in order to get around that problem, u need to bootstrap the cluster

VAULT_TOKEN=<root_token> VAULT_ADDR=[https://127.0.0.1:8200](https://127.0.0.1:8200/) vault write -tls-skip-verify=true -f sys/storage/raft/bootstrap

and for that you will need the root token. I say root token as in your setup HA will be busted keeping you from login when you run it with the new ha_storage stanza.

There are some more challenges that u will face eventually, like node_id value if changed, will need a new bootstrap again. I am not sure why, but if you, for example bootstrapped 1 instance cluster with node_id=theone and then later changed config to have node_id=thesecond, then the HA will be busted again. Note: I have similar setup as you. Auto unseal KMS + S3 storage backend running OSS 1.5.3.

you will need to re-bootstrap and in order to that you will need to remove the raft TLS key from the storage backend. I just made a backup first to ensure that I don’t break things.

Even if you will to not change node_id, eventually you will run into an issue where taking down standby will bust HA on master or vice versa. This is the biggest issue right now. Even when I take down active master, standby never becomes the new active. If I take down standby, then master HA is busted. It is as if that the entire quorum is needed to keep rafting. take one down, and whole thing collapses, which seems bit opposite to HA design. I believe it could be bug or just lack of knowledge on our part of how this is supposed to work.

I really would like some Hashicorp person to comment on this behavior. So far this experiment has made me think that raft as only HA is probably not a good option right now on vault 1.5.3. Not sure about other versions.

I spent a lot of time with this yesterday. the problem is that raft when used only as HA storage doesn’t bootstrap raft in vault.

This is correct. If you’re using raft as the ha_backend mechanism, the cluster needs to be bootstrapped manually. You will need to call sys/storage/raft/bootstrap on one of the nodes initiate the bootstapping process and sys/storage/raft/join on the rest of the nodes once that’s done (no leader address needs to be provided since vault can get this from the shared storage backend). Both of these operations needs to be done once the nodes are unsealed.

For a detailed walkthrough, refer to this Learn guide.

Thanks for the info. Do you know why stopping vault process for any raft associated peer will make the cluster HA unhealthy? During my testing, I found that HA was busted if I stopped active or standby peer. Is that the idea? starting stopped peer back, resulted in HA being alive. Is this how it usually works? Like with other HA backends, losing any instances still kept HA alive. I tested with 2 node cluster here.

also, with node_id="node_${ip_address}" setup, bootstrap was needed again, which was bit weird as well. In a cloud deployment model, getting unique node id would be necessary and not necessarily same all the time. Much appreciate your thoughts on above two questions as well.

I was able to see HA active, but it wasn’t truly HA as per me with raft.

If you are using a 2 node cluster there is no ability to handle failure.

For raft, like many clustering systems, the cluster can resist n/2 (rounded down) failures without failure.

So for 1 or 2 nodes any failure will destroy the cluster. So you need at least 3 nodes.

The quorum required by raft to establish leadership is determined by (N/2)+1, where N is the total number of nodes in the cluster. This is in essence the minimum number of nodes required for the Vault to be operational given N nodes in your cluster. You would need to have at least 3 nodes so that failure of 1 node is still within the quorum requirement.

As far as node_id goes, these are intended to be unique identifiers to the node. If the nodes are ephemeral in nature, you would not have to bootstrap again whenever they are spun up, but you would have to join them to the existing cluster, and probably remove any dangling nodes via remove-peer whenever they are torn down to keep the number of nodes within a cluster from growing over time.

Awesome info shared above. I am trying the suggestion as I am replying. Stay tuned!

@calvn so I may be way off here, but seems to me if I destroy an existing raft cluster, and spin up new one, I need to re-bootstrap all the time? Also the new bootstrap fails

Error writing data to sys/storage/raft/bootstrap: Error making API request.
URL: PUT https://127.0.0.1:8200/v1/sys/storage/raft/bootstrap
Code: 500. Errors:
* could not generate TLS keyring during bootstrap: TLS keyring already present

I need to manually remove the tls file from core/raft/ for bootstrap to work again. Is the case for already boostrapped cluster and new complete rollout covered?

You might be re-using the storage backend since TLS information is stored on storage and not ha_storage. You will have to do a rolling update to prevent all nodes from the cluster being destroyed at the same time (or wipe out the storage backend’s data and start from scratch if that’s desired). If all cluster nodes are destroyed before a new one joins, then there’s not state to indicate that a cluster has already been bootstrapped.

Yup, I guess u can’t get around this one. But makes sense. I was hoping somehow it allowed you re-bootstrap if you happen to kill the previous quorum entirely without fiddling with the storage backend. Anyway, I did come to the same conclusion as above. Usually if I go with Dynamo or something else, I need not worry about previous HA state. Thanks!