Migration from consul to integrated storage using consul as a service discovery

pshinde · March 8, 2022, 10:25pm

We are moving from consul to integrated storage using vault-oss 1.6.7. And this is our vault config. But we are having issue with joining the followers to leader node. The join commad says successful but after unsealing both nodes status shows as a leader.

cat /etc/vault.d/vault-config

ui = true
cluster_addr = “https://10.0.0.1:8201”
api_addr = “https://10.0.0.1:8200”
disable_mlock = true
storage “raft” {
path = “/vault/raft/data”
node_id = “10.0.0.1”

retry_join {
auto_join = “provider=aws tag_key=Application tag_value=Vault addr_type=private_v4”
auto_join_scheme = “https”
leader_tls_servername = “vault.service.domain.com”
leader_client_cert_file = “/etc/ssl/vault/fullchain.pem”
leader_client_key_file = “/etc/ssl/vault/privkey.pem”

}
}

listener “tcp” {
address = “10.0.0.1:8200”
cluster_addr = “10.0.0.1:8201”
tls_cipher_suites = “”
tls_prefer_server_cipher_suites = “true”
tls_min_version = “tls12”
tls_cert_file = “/etc/ssl/vault/fullchain.pem”
tls_key_file = “/etc/ssl/vault/privkey.pem”
}
service_registration “consul”{
token = “xxxxxxxxxxx”

}
telemetry {
statsd_address = “127.0.0.1:8125”
}

aram · March 9, 2022, 9:47am

I’m going to assume these are in different zones, but use the same SG? Can you post the config from the second node, and also I’d start the nodes in debug mode to get logging and post that as well.

My guess is that either you have a conflict or your strict join parameters are throwing something off.

A small side note, if this is a production environment. I’d highly recommend upgrading to at least 1.8 before using integrated storage. I’m not aware of any single issues but even Hashicorp doesn’t encourage it until 1.8+ because of the maturity of the code.

pshinde · March 9, 2022, 5:15pm

Everything is deployed using same IAM role, same SG and in same region as a part of single deployment. 2 nodes are in same AZ and third one in different AZ.
We are moving from vault1.6.7 with consul backend to integrated storage.

Here is second node config:
cat /etc/vault.d/vault-config

ui = true

cluster_addr = “https://10.0.0.2:8201”

api_addr = “https://10.0.0.2:8200”

disable_mlock = true

storage “raft” {

path = “/vault/raft/data”

node_id = “10.0.0.2”

retry_join {

auto_join = “provider=aws tag_key=Application tag_value=Vault addr_type=private_v4”

auto_join_scheme = “https”

leader_tls_servername = “vault.service.domain.com”

leader_client_cert_file = “/etc/ssl/vault/fullchain.pem”

leader_client_key_file = “/etc/ssl/vault/privkey.pem”

}

listener “tcp” {

address = “10.0.0.2:8200”

cluster_addr = “10.0.0.2:8201”

tls_cipher_suites = “TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA”

tls_prefer_server_cipher_suites = “true”

tls_min_version = “tls12”

tls_cert_file = “/etc/ssl/vault/fullchain.pem”

tls_key_file = “/etc/ssl/vault/privkey.pem”

}

service_registration “consul”{

token = “xxxxxxxxxxx”

}

telemetry {

statsd_address = “127.0.0.1:8125”

}

pshinde · March 9, 2022, 5:19pm

We can not use KMS-auto unseal option so we are trying different options here. We are facing chicken and egg problem. All nodes don’t have a shared view of storage until the raft cluster has been formed, but we’re trying to form the raft cluster using TLS enabled.

aram · March 9, 2022, 6:30pm

The issue that is that you’re telling each node to be it’s own cluster.

api_addr should be the LB that provides access to all nodes
I’d recommend removing the “node_id”, let it auto set that. There is no reason to force it.
tls_cipher_suites can’t be empty, either set it or don’t set the variable.
license_path via config maybe deprecated. You may want start the cluster then set it as a put
statsd should be point to 127.0.0.1:8125 (not 127.0.0.2)

pshinde · March 9, 2022, 7:23pm

Thank you Aram for getting back.
So couple of things api_addr changed to our vault service url similar to leader_tls_servername.

We use consul service discovery functionality so we get vault.consul-service url.

tls_cipher_suites is not empty, I just removed it for simplicity and
statsd is also 127.0.0.1:8125.

Note: I followed this guide for migration: Storage Migration tutorial - Consul to Integrated Storage | Vault - HashiCorp Learn

The setup after migration with TLS-disabled is running fine but we are just trying to enable TLS now.

pshinde · March 9, 2022, 8:38pm

The CA file is missing in retry join and in the listener stanza. Would that be something causing this issue?

pshinde · March 9, 2022, 8:51pm

After changing the api_addr variable and adding LB.

Code: 503. Errors:
*** Vault is sealed"**
-03-09T20:42:57.198Z [INFO] core: security barrier not initialized
2022-03-09T20:42:57.198Z [INFO] core: attempting to join possible raft leader node: leader_addr=https://10.0.0.1:8200
2022-03-09T20:42:57.206Z [WARN] core: join attempt failed: error=“error during raft bootstrap init call: Put “https://10.0.0.1:8200/v1/sys/storage/raft/bootstrap/challenge”: Forbidden”
2022-03-09T20:42:57.206Z [ERROR] core: failed to retry join raft cluster: retry=2s

nip572 · March 10, 2022, 2:07pm

@aram I am also facing exactly the same issue and my setup is exactly like @pshinde where we are using consul as a service discovery mechanism and don’t have a LB. My configs are also identical to @pshinde configs. Any help would be really unblock me and @pshinde . I have been stuck on this error for about 5 days now.

aram · March 10, 2022, 6:58pm

Yes, the SSL connection must be successful with the full cert available, otherwise the connection will fails.

My suggestion is to turn off SSL and validate your enviornment before adding SSL back in.

aram · March 10, 2022, 6:59pm

service discovery shouldn’t have anything to do with raft connectivity as they need to be either in a cloud provider’s query, or statically defined.

Most likely issue is an SSL or network connection issue. Remove SSL until you have connectivity and everything is up and running then add SSL back in.

pshinde · March 10, 2022, 9:00pm

Everything was working when TLS was disabled but now can’t seems to get second node to join the cluster. I will post if see any new issues.

So I guess we just have to have tls disabled cluster on startup to get around this.

Thank you!

pshinde · April 4, 2022, 7:19pm

We tried enabling TLS with these configs but no luck so far.

cluster_addr = “https://${IP}.vault.service.consul-internal.com::8201”
api_addr = “https://.vault.service.consul-internal.com::8200”

Also tried adding IP in sans but on follower node’s vault status, we do not see active nodes address.

vault status -tls-skip-verify
Key Value

Seal Type shamir
Initialized true
Sealed false
Total Shares 5
Threshold 3
Version 1.6.7
Storage Type raft
Cluster Name vault-cluster-01234-absdcfd
Cluster ID 01234xxxxxxxx
HA Enabled true
HA Cluster n/a
HA Mode standby
Active Node Address
Raft Committed Index 1233456

vault operator raft list-peers -tls-skip-verify
Error reading the raft cluster configuration: Error making API request.

URL: GET https://10.0.0.2:8200/v1/sys/storage/raft/configuration
Code: 500. Errors:

local node not active but active cluster node not found

Topic		Replies	Views
Failing to migrate from Consul to integrated storage Vault	7	1015	January 31, 2021
Guidance on migrating data from vault/consul (antiquated versions) --> vault/raft (newest) Vault	1	359	August 14, 2020
Helm Installed Vault with Consul Backend with Auto Encrypt and TLS Enabled using Consul Client Consul	5	710	April 12, 2022
403 error when migrating to integrated storage Vault	3	1919	July 22, 2020
Consul reporting two Vault active nodes Consul vault	7	508	November 21, 2022

Migration from consul to integrated storage using consul as a service discovery

Related topics