Failing to migrate from Consul to integrated storage

I’m following the guide at Storage Migration tutorial - Consul to Integrated Storage to migrate our Vault/Consul installation to use integrated storage.

I’ve shut down our three Vault nodes and then used a migrate.hcl file to run vault on one node to migrate the keys from Consul to the configured path.

The single node is then started up. We’re using AWS KMS to auto-unseal. However, when I try to start up the next node, it fails to join the raft cluster:

Dec 29 14:23:45 ip-172-31-10-101 vault[32704]: 2020-12-29T14:23:45.350Z [INFO]  core: stored unseal keys supported, attempting fetch
Dec 29 14:23:45 ip-172-31-10-101 vault[32704]: 2020-12-29T14:23:45.351Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
Dec 29 14:23:45 ip-172-31-10-101 vault[32704]: 2020-12-29T14:23:45.351Z [INFO]  core: raft retry join initiated
Dec 29 14:23:45 ip-172-31-10-101 vault[32704]: 2020-12-29T14:23:45.351Z [INFO]  core: security barrier not initialized
Dec 29 14:23:45 ip-172-31-10-101 vault[32704]: 2020-12-29T14:23:45.352Z [INFO]  core: security barrier not initialized
Dec 29 14:23:45 ip-172-31-10-101 vault[32704]: 2020-12-29T14:23:45.352Z [INFO]  core: attempting to join possible raft leader node: leader_addr=https://<redacted>:8200
Dec 29 14:23:45 ip-172-31-10-101 vault[32704]: 2020-12-29T14:23:45.665Z [WARN]  core: join attempt failed: error="could not retrieve raft bootstrap package"

This is the output of vault status on the first node:

Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    1
Threshold                1
Version                  1.6.1
Storage Type             raft
Cluster Name             vault-cluster-4ab2cc7b
Cluster ID               3d4caa24-e92e-bf89-3b63-af7691172106
HA Enabled               true
HA Cluster               https://<redacted>:8201
HA Mode                  standby
Active Node Address      https://<redacted>:8200
Raft Committed Index     590502
Raft Applied Index       590501

On the second node, however, it is this:

Key                      Value
---                      -----
Recovery Seal Type       awskms
Initialized              false
Sealed                   true
Total Recovery Shares    0
Threshold                0
Unseal Progress          0/0
Unseal Nonce             n/a
Version                  1.6.1
Storage Type             raft
HA Enabled               true

I’ve tried using vault operator raft join and also adding a retry_join stanza to the configuration but it doesn’t make any difference.

1 Like

Can anyone help with this please? We’ve just experienced an outage because Consul broke after our Lets Encrypt certs got renewed. I desperately want to simplify our infrastructure by removing Consul and focussing on just Vault.

Thanks.

Hi there. I just had this problem. In my case, it was a dumb configuration strategy I was pursuing, trying to be “clever” about how I attempted more automated deployments using env vars not to override config values, but to BE the config values (turns out the agents really like the config blocks to be there in the hcl files).

Can you post some redacted versions of your server agents’ HCL files and a rough description of how things are set up?

Sure … at the risk of getting slammed over how I’ve gone about this :smiley:

We’ve got three servers. All three are running Vault and Consul.

Here is a sample Consul config:

{
    "server": true,
    "node_name": "node-1",
    "datacenter": "aws",
    "data_dir": "/srv/consul/data",
    "bind_addr": "0.0.0.0",
    "client_addr": "0.0.0.0",
    "advertise_addr": "NODE-1-IP",
    "bootstrap_expect": 3,
    "retry_join": ["NODE-1-IP", "NODE-2-IP", "NODE-3-IP"],
    "ui": false,
    "log_level": "info",
    "enable_syslog": true,
    "cert_file": "/etc/letsencrypt/live/blah/fullchain.pem",
    "key_file": "/etc/letsencrypt/live/blah/privkey.pem",
    "ca_file": "/etc/letsencrypt/live/blah/chain.pem",
    "verify_outgoing": true,
    "verify_incoming": false,
    "ports": {
        "https": 8500,
        "http": -1
    },
    "acl": {
        "enabled": true,
        "default_policy": "allow",
        "enable_token_persistence": true
    }
}

and here is a sample Vault config:

listener "tcp" {
  address          = "node-1:8200"
  cluster_address  = "node-1:8201"
  tls_cert_file    = "/etc/letsencrypt/live/blah/fullchain.pem"
  tls_key_file     = "/etc/letsencrypt/live/blah/privkey.pem"
}

storage "consul" {
  scheme = "https"
  address = "node-1:8500"
  path    = "vault/"
}

seal "awskms" {
  region = "us-east-1"
  kms_key_id = "REDACTED"
}

api_addr =  "https://node-1:8200"
cluster_addr = "https://node-1:8201"
disable_mlock = true
ui = true

When trying to switch to integrated storage, I commented out the storage "consul" block and introduced this new block:

storage "raft" {
  path = "/srv/vault/raft/"
  node_id = "node-1"
}

Many thanks!

So, a couple of things

  1. You should probably not be using Lets Encrypt for your agent certificates. As a TLS cert for a reverse-proxy to terminate with, it should work fine, but you’re better off using a self-signed certificate that nothing from the outside world besides other Consul and Vault client agents uses; i.e. not a browser --unless you are requiring the browser to have it in order to access the UI, but it sounds like you are not attempting to go in that direction.

  2. You might try spinning up a quick set of docker or Vagrant images with your config, but with TLS turned off. If I had to guess, the issue you are experiencing is likely not due to the migration, but the config itself. You might also have some env var still pointing to the old Consul data directory. Without seeing the actual environment, I’m just guessing, but setting up a quick abstraction to test out your configuration away from the current install might help rule out a few things. On a clean install (minus your Lets Encrypt certs and TLS) that config looks like it would go. If it does not, there are probably some rudiments in your config which need to be adjustment, if it does, then you might have to get some help specifically with the migration. I started my journey using Raft Storage and only migrated to the KMS auto-encrypt.

  3. As intimated above, double check all your ENV vars being used. Make sure you’re not doing something unintended. If something is in an env var that is not reflected in your config, try unsetting the var and having it set explicitly in the config template. Specifically, you may want to try putting your AWS creds in the seal block of your template file, just while you are troubleshooting. Also double check that the credentials being used have access to the key being used.

  4. If you have the cluster_addr in the main level of attributes, you don’t need to describe it again in the listener block. I don’t anyway, and haven’t had any problems with connection. It may have no effect.

Post back if that gains you any ground. As I mentioned, for me it was something dumb that I was doing to try to streamline deployment and configuration. If you have anything in your setup that you feel might live in that category, try undoing it --at least temporarily-- to see if you get unstuck. The primary node was where the configuration issue was for me. Even though it was properly initialized and unsealed, my having veered from the proper config patterns caused it to be unreachable by other nodes.

Thanks, @dehuszar, for your comments and suggestions.

We’re running all of our infrastructure on AWS which has led to Vault being operated as a publicly-reachable service. That is why Vault is using publicly-issued certs. Since both Vault and Consul are running on the same servers, it just seemed simpler to use the same certs with Consul.

I’ve checked the ENV vars on all three nodes and there is nothing relevant to Consul. From our installation documentation, everything is defined within the configuration files.

Looking back at the info/warn messages I got when I tried the migration, it looks like the issue may be related to unsealing the data, but I may be misinterpreting the messages. As I initially stated, we’re using AWS KMS to auto-unseal. I could try migrating back to Shamir seal before trying to migrate to integrated storage …

So the problem was that I was using AWS auto-unseal. I had to migrate back to a manual unseal key, then migrate to integrated storage, then migrate back to AWS auto-unseal.

Apart from that, the migration went swimmingly. Phew!

1 Like

If you are using AWS you could use an ALB for your public facing usage. That could then use an automatic ACM certificate. Internally you could use self-signed as ALBs don’t check for certificate validity using root certificates (so self-signed doesn’t cause an issue). You also get the advantage of one less thing to manage as ACM is fully automated & hosted while Lets Encrypt needs you to run a regular update command/service.