Migration from consul to integrated storage using consul as a service discovery

We are moving from consul to integrated storage using vault-oss 1.6.7. And this is our vault config. But we are having issue with joining the followers to leader node. The join commad says successful but after unsealing both nodes status shows as a leader.

cat /etc/vault.d/vault-config

ui = true
cluster_addr = “https://10.0.0.1:8201
api_addr = “https://10.0.0.1:8200
disable_mlock = true
storage “raft” {
path = “/vault/raft/data”
node_id = “10.0.0.1”

retry_join {
auto_join = “provider=aws tag_key=Application tag_value=Vault addr_type=private_v4”
auto_join_scheme = “https”
leader_tls_servername = “vault.service.domain.com
leader_client_cert_file = “/etc/ssl/vault/fullchain.pem”
leader_client_key_file = “/etc/ssl/vault/privkey.pem”

}
}

listener “tcp” {
address = “10.0.0.1:8200”
cluster_addr = “10.0.0.1:8201”
tls_cipher_suites = “”
tls_prefer_server_cipher_suites = “true”
tls_min_version = “tls12”
tls_cert_file = “/etc/ssl/vault/fullchain.pem”
tls_key_file = “/etc/ssl/vault/privkey.pem”
}
service_registration “consul”{
token = “xxxxxxxxxxx”

}
telemetry {
statsd_address = “127.0.0.1:8125”
}

I’m going to assume these are in different zones, but use the same SG? Can you post the config from the second node, and also I’d start the nodes in debug mode to get logging and post that as well.

My guess is that either you have a conflict or your strict join parameters are throwing something off.

A small side note, if this is a production environment. I’d highly recommend upgrading to at least 1.8 before using integrated storage. I’m not aware of any single issues but even Hashicorp doesn’t encourage it until 1.8+ because of the maturity of the code.

Everything is deployed using same IAM role, same SG and in same region as a part of single deployment. 2 nodes are in same AZ and third one in different AZ.
We are moving from vault1.6.7 with consul backend to integrated storage.

Here is second node config:
cat /etc/vault.d/vault-config

ui = true

cluster_addr = “https://10.0.0.2:8201

api_addr = “https://10.0.0.2:8200

disable_mlock = true

storage “raft” {

path = “/vault/raft/data”

node_id = “10.0.0.2”

retry_join {

auto_join = “provider=aws tag_key=Application tag_value=Vault addr_type=private_v4”

auto_join_scheme = “https”

leader_tls_servername = “vault.service.domain.com

leader_client_cert_file = “/etc/ssl/vault/fullchain.pem”

leader_client_key_file = “/etc/ssl/vault/privkey.pem”

}

}

listener “tcp” {

address = “10.0.0.2:8200”

cluster_addr = “10.0.0.2:8201”

tls_cipher_suites = “TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA”

tls_prefer_server_cipher_suites = “true”

tls_min_version = “tls12”

tls_cert_file = “/etc/ssl/vault/fullchain.pem”

tls_key_file = “/etc/ssl/vault/privkey.pem”

}

service_registration “consul”{

token = “xxxxxxxxxxx”

}

telemetry {

statsd_address = “127.0.0.1:8125”

}

We can not use KMS-auto unseal option so we are trying different options here. We are facing chicken and egg problem. All nodes don’t have a shared view of storage until the raft cluster has been formed, but we’re trying to form the raft cluster using TLS enabled.

The issue that is that you’re telling each node to be it’s own cluster.

  • api_addr should be the LB that provides access to all nodes
  • I’d recommend removing the “node_id”, let it auto set that. There is no reason to force it.
  • tls_cipher_suites can’t be empty, either set it or don’t set the variable.
  • license_path via config maybe deprecated. You may want start the cluster then set it as a put
  • statsd should be point to 127.0.0.1:8125 (not 127.0.0.2)

Thank you Aram for getting back.
So couple of things api_addr changed to our vault service url similar to leader_tls_servername.

We use consul service discovery functionality so we get vault.consul-service url.

tls_cipher_suites is not empty, I just removed it for simplicity and
statsd is also 127.0.0.1:8125.

Note: I followed this guide for migration: Storage Migration tutorial - Consul to Integrated Storage | Vault - HashiCorp Learn

The setup after migration with TLS-disabled is running fine but we are just trying to enable TLS now.

The CA file is missing in retry join and in the listener stanza. Would that be something causing this issue?

After changing the api_addr variable and adding LB.

Code: 503. Errors:
*** Vault is sealed"**
-03-09T20:42:57.198Z [INFO] core: security barrier not initialized
2022-03-09T20:42:57.198Z [INFO] core: attempting to join possible raft leader node: leader_addr=https://10.0.0.1:8200
2022-03-09T20:42:57.206Z [WARN] core: join attempt failed: error=“error during raft bootstrap init call: Put “https://10.0.0.1:8200/v1/sys/storage/raft/bootstrap/challenge”: Forbidden”
2022-03-09T20:42:57.206Z [ERROR] core: failed to retry join raft cluster: retry=2s

@aram I am also facing exactly the same issue and my setup is exactly like @pshinde where we are using consul as a service discovery mechanism and don’t have a LB. My configs are also identical to @pshinde configs. Any help would be really unblock me and @pshinde . I have been stuck on this error for about 5 days now.

Yes, the SSL connection must be successful with the full cert available, otherwise the connection will fails.

My suggestion is to turn off SSL and validate your enviornment before adding SSL back in.

service discovery shouldn’t have anything to do with raft connectivity as they need to be either in a cloud provider’s query, or statically defined.

Most likely issue is an SSL or network connection issue. Remove SSL until you have connectivity and everything is up and running then add SSL back in.

1 Like

Everything was working when TLS was disabled but now can’t seems to get second node to join the cluster. I will post if see any new issues.

So I guess we just have to have tls disabled cluster on startup to get around this.

Thank you!

We tried enabling TLS with these configs but no luck so far.

cluster_addr = “https://${IP}.vault.service.consul-internal.com::8201”
api_addr = “https://.vault.service.consul-internal.com::8200”

Also tried adding IP in sans but on follower node’s vault status, we do not see active nodes address.

vault status -tls-skip-verify
Key Value


Seal Type shamir
Initialized true
Sealed false
Total Shares 5
Threshold 3
Version 1.6.7
Storage Type raft
Cluster Name vault-cluster-01234-absdcfd
Cluster ID 01234xxxxxxxx
HA Enabled true
HA Cluster n/a
HA Mode standby
Active Node Address
Raft Committed Index 1233456

vault operator raft list-peers -tls-skip-verify
Error reading the raft cluster configuration: Error making API request.

URL: GET https://10.0.0.2:8200/v1/sys/storage/raft/configuration
Code: 500. Errors:

  • local node not active but active cluster node not found