IP subjectAltName makes Cloud use really difficult

I am using Vault 1.6.0 with raft and have Cloud (AWS) auto_join working and nodes talking to each other over HTTPS, but I’m wondering why the requirement to embed the IP address in the subjectAltName (SAN). It makes it virtually impossible to use with DHCP and/or auto-scaling groups.

Requiring that the SSL cert have the IP address embedded seems at odds with how the public Internet works, where the Certificate Authority vouches for the authenticity of the cert. Each node is now a “pet” instead of “cattle”, to borrow that analogy, because the node has to be built and its IP known (and it cannot change) before creating and signing the CSR.

Is there some way to turn off the IP SAN requirement? Does it support wildcards? I know about -tls-skip-verify for CLI commands, but that doesn’t seem to help with the server itself.

I’m confused. I didn’t think that was a requirement.

I know you can include 127.0.0.1 in some infrastructures, to enable CLI and cURL usage of certificates. But, as you say, anything beyond that is very limiting.

Can you provide more information about what’s failing in your infrastructure, before you took the step of including IP SANs?

Vault itself seems to be requiring the IP SAN entries. 127.0.0.1 alone isn’t enough.

An indicator of Vault’s reliance on the IP SAN field is shown in this error message (I changed the IP address of the node, causing the error):
core: join attempt failed: error="error during raft bootstrap init call: Put "https://10.0.11.68:8200/v1/sys/storage/raft/bootstrap/challenge": x509: certificate is valid for 10.0.11.63, 127.0.0.1, not 10.0.11.68"

Wow. Now I’m really confused.

Can you share what those certs look like? What’s the CN? FQDNs as SANs too?

Just to clarify, when you say you’re confused, is it because this isn’t supposed to be a requirement? Or something else?

No, that’s correct: not a requirement. I’m just getting up to speed, mind; no expert. But I have a Consul-backed cluster and a Raft-backed one, both communicating over TLS exclusively – one even auto-seals the other via transit! – and I haven’t had to worry about IP addresses at all.

Very strange indeed! I wonder if it’s specific to the Cloud auto_join feature? Are you doing this on prem?

H’m, no, good point, I’m not. Still seems a strange thing to introduce with that functionality, but I’m probably missing something.

I’m going to try again without any SAN at all and I’ll post the specific error message shortly.

1 Like

With no IP SAN entries at all, this is the error message:

Error initializing: Put "https://127.0.0.1:8200/v1/sys/init": x509: cannot validate certificate for 127.0.0.1 because it doesn't contain any IP SANs

With only 127.0.0.1 as the IP SAN entry, this is the error message:

Dec  4 17:22:51 ip-10-0-11-68 vault[54576]: 2020-12-04T17:22:51.591Z [WARN]  core: join attempt failed: error="error during raft bootstrap init call: Put "https://10.0.85.90:8200/v1/sys/storage/raft/bootstrap/challenge": x509: certificate is valid for 127.0.0.1, not 10.0.85.90"

At this point it looks to me like IP SAN is required by Vault. That’s a non-Cloud approach, in my view, and I’d love to know how to work around or fix it.

Is a name service/DNS available in your environment? I wonder whether Raft is falling back on IPs in the absence of fully qualified domain names.

I am using DNS with both forward and reverse entries. However, the auto-discovery process uses IP address, so perhaps that’s a piece to this puzzle?

Dec  4 17:22:51 ip-10-0-11-68 vault[54576]: 2020-12-04T17:22:51.518Z [INFO]  core: [INFO] discover-aws: Filter instances with Project=Vault
Dec  4 17:22:51 ip-10-0-11-68 vault[54576]: 2020-12-04T17:22:51.572Z [INFO]  core: [DEBUG] discover-aws: Found 3 reservations
Dec  4 17:22:51 ip-10-0-11-68 vault[54576]: 2020-12-04T17:22:51.572Z [INFO]  core: [DEBUG] discover-aws: Reservation r-09e7e503a0b5545e8 has 1 instances
Dec  4 17:22:51 ip-10-0-11-68 vault[54576]: 2020-12-04T17:22:51.572Z [INFO]  core: [DEBUG] discover-aws: Found instance i-0709a7f1947e7175b
Dec  4 17:22:51 ip-10-0-11-68 vault[54576]: 2020-12-04T17:22:51.572Z [INFO]  core: [INFO] discover-aws: Instance i-0709a7f1947e7175b has private ip 10.0.85.90
Dec  4 17:22:51 ip-10-0-11-68 vault[54576]: 2020-12-04T17:22:51.572Z [INFO]  core: [DEBUG] discover-aws: Reservation r-02bce70c1349c66f3 has 1 instances
Dec  4 17:22:51 ip-10-0-11-68 vault[54576]: 2020-12-04T17:22:51.572Z [INFO]  core: [DEBUG] discover-aws: Found instance i-03204bc58099a88a5
Dec  4 17:22:51 ip-10-0-11-68 vault[54576]: 2020-12-04T17:22:51.572Z [INFO]  core: [INFO] discover-aws: Instance i-03204bc58099a88a5 has private ip 10.0.44.77
Dec  4 17:22:51 ip-10-0-11-68 vault[54576]: 2020-12-04T17:22:51.572Z [INFO]  core: [DEBUG] discover-aws: Reservation r-02fb871e4cea2b69c has 1 instances
Dec  4 17:22:51 ip-10-0-11-68 vault[54576]: 2020-12-04T17:22:51.572Z [INFO]  core: [DEBUG] discover-aws: Found instance i-0f68f8f757cdaf388
Dec  4 17:22:51 ip-10-0-11-68 vault[54576]: 2020-12-04T17:22:51.572Z [INFO]  core: [INFO] discover-aws: Instance i-0f68f8f757cdaf388 has private ip 10.0.11.68
Dec  4 17:22:51 ip-10-0-11-68 vault[54576]: 2020-12-04T17:22:51.572Z [INFO]  core: [DEBUG] discover-aws: Found ip addresses: [10.0.85.90 10.0.44.77 10.0.11.68]
Dec  4 17:22:51 ip-10-0-11-68 vault[54576]: 2020-12-04T17:22:51.572Z [INFO]  core: security barrier not initialized
Dec  4 17:22:51 ip-10-0-11-68 vault[54576]: 2020-12-04T17:22:51.572Z [INFO]  core: attempting to join possible raft leader node: leader_addr=https://10.0.85.90:8200
Dec  4 17:22:51 ip-10-0-11-68 vault[54576]: 2020-12-04T17:22:51.591Z [WARN]  core: join attempt failed: error="error during raft bootstrap init call: Put "https://10.0.85.90:8200/v1/sys/storage/raft/bootstrap/challenge": x509: certificate is valid for 127.0.0.1, not 10.0.85.90"

I’m confused here.
Are you trying to use https://10.0.11.68:8200 and expect a cert for vault.yourco.com will work?
Vault should be served up and consumed by clients with a domain name, not the IP.

This is the automatic discovery feature in 1.6. As the logs above show, the hosts themselves are querying the environment for specific tags that indicate a potential Vault peer.

If Vault requires a DNS name, then I would expect Vault to attempt a DNS lookup based on the IP that the Vault discovery mechanism found.

@mikegreen’s comment had an interesting clue. Using the DNS name and manually joining the node seems to work. However, this completely defeats the purpose of Cloud auto_join.

Right now this looks like a design flaw in the discover-aws module. It scans and finds the other EC2 instances, but tries to connect via IP address, which leads to the IP SAN rabbit hole.

It seems like a simple solution would be to perform a reverse lookup on the IP address before connecting.

To close the loop on this, I’ve filed a bug report: