I’ve been trial-and-erroring my way through spinning up a Vault/Consul cluster, and it’s about ready to go live with my user base.
The only problem remaining is that I’m having a very difficult time getting everthing to work with the commercially issued TLS cert I’m trying to use.
If I use the (wildcard) cert as it is provided, I can get the frontend working with no cert errors in the browser.
However the backend is a bit of a mess. Any CLI Vault commands give me the error stating there are no IP SANs in my cert for 127.0.0.1. I can get around this by running the export VAULT_ADDR='vault.site.com:8200' command - and adding an entry to the server(s) hosts file to pair the FQDN to 0.0.0.0. For some reason, this doesn’t persist after reboot. I also tried switching any references going to port 8200 in the defaults.hcl to the FQDN.
So, that’s a bit of a kludge, but that resolves my CLI errors (at least until that machine is rebooted).
However even after all of that, API access doesn’t work unless I give the API user my root cert and they add that as a part of their curl command. So far, that’s the only way I’ve managed to fix that problem.
All of this seems to relate back to the commercial cert. From what I’ve read, it seems like Vault wants you to use a self-signed cert with a SAN for vault.service.consul and 127.0.0.1 and supply the cert to all users to install on their machines. Without installing the cert, the users obviously get cert errors in the browser, which I need to avoid. I tried using a terraform script to insert my own SANs into the existing cert, but as you might imagine, that essentially just turned it in to a self-signed cert.
Am I the only one who kind of needs to use a commercial cert? Is there a way to install one properly in a way that all methods of access will work?
If you don’t set VAULT_ADDR the CLI will default to https://127.0.0.1:8200 as you mention. You can set that value automatically when you login to a shell - take a look at things like .bashrc.
I’m not sure what you mean about “host file to pair the FQDN to 0.0.0.0”? Normally you would add a vault record to your site.com DNS zone, with that being an A/AAAA/CNAME record pointing to your cluster. For example it could be pointing at the IP address of a load balancer. If you are using Consul on each server for service discovery you could just point the vault DNS record to 127.0.0.1.
If you are having to provide a TLS certificate when you use commands such as curl that suggests either that Vault doesn’t contain the full certificate chain or that the root being used isn’t included in the server’s truststore (for example if it is fairly new).
For a commercial certificate you need to be using FQDNs that you control and are globally unique. So you could use vault.example.com (if you owned example.com) but you can’t use IP SANs for 127.0.0.1, 192.168.1.200 or name SANs for localhost, vault.local or vault.service.consul.
I can say that using a commercial certificate works well once configured, so I wouldn’t say that there is any expectation around using self-signed or internal CA certificates, although that is of course possible.
Thanks for your reply. In dev parlance: if nothing else, I feel like I am one semicolon away from glory.
This morning, I rebuilt my AMI with a combo cert that is the server, int, and root certs (in that order from top to bottom) and pushed it out. I’m still having all the same issues. If I add that same cert to my curl query, the query succeeds without issue.
I tried setting the cert using export VAULT_CACERT=/path/to/cert to no avail. Not sure what I’m doing wrong. Is the combination of certs supposed to be in the opposite order? This is a Digicert wildcard cert, so even if the root cert wasn’t part of the chain, it’s definitely installed on the server already.
I will admit to having DNS issues that may irritate this problem (but I don’t think are directly attributable). The whole thing is on a private network using an internally facing ELB (this is all in AWS). Clients can currently connect to the UI as designed though (I’d explain why it’s set up this way, but I don’t think it’s relevant). But obviously that means I can’t just throw an A record on my public domain registrar. The nodes themselves are not resolving the FQDN which is why I had to throw it in the hosts file. But you’re saying the server hosts file entry should go to 127.0.0.1 not 0.0.0.0? Worth a shot.