I’m not really sure where I went wrong. I’m attempting to stand up a Vault environment with HA according to the recommended 2 Availability Zone model, with Consul. This is my first time using Hashicorp products. Most of the troubleshooting I’ve seen online seems to be from the approach of using these products in a containerized environment, but mine are currently running in AWS as traditional servers.
I have managed to install Consul and Vault on all the respective servers, and I’m at least getting somewhere, judging from my primary Vault server - I can at least get to the UI. But the backend is a mess and I don’t know where I went wrong.
The Vault service is running properly on my primary Vault server (which I’m using as a test before fixing the things I’ve figured out are wrong on the other servers). The consul agent service fails with the error in the subject. Syslog is just spitting out the following two messages over and over, and the Vault is (obviously) locked:
Nov 13 06:25:07 vault-alpha vault[2791]: 2020-11-13T06:25:07.230Z [WARN] service_registration.consul: check unable to talk with Consul backend: error="Unexpected response code: 500 (Unknown check "vault:172.31.119.46:8200:vault-sealed-check")"
Nov 13 06:25:07 vault-alpha vault[2791]: 2020-11-13T06:25:07.779Z [WARN] service_registration.consul: reconcile unable to talk with Consul backend: error="service registration failed: Unexpected response code: 403 (Permission denied)"
Can someone help me figure out how to right this ship? I would be very grateful.
Is there a reason you’re using Consul vs Integrated Storage? IS is much simplier.
First glance looks like Vault can’t talk to the Consul agent client…
For anyone to help troubleshooting, it’d be good to post:
Your config files
Output of vault status
Output of consul members
Output of consul operator raft list-peers
Also, the learn.hashicorp.com site has tutorials for non-containerized deployments. Unsure which you’re following…
Error checking seal status: Get "https://127.0.0.1:8200/v1/sys/seal-status": x509: cannot validate certificate for 127.0.0.1 because it doesn't contain any IP SANs
Output of consul members:
Node Address Status Type Build Protocol DC Segment
consul-alpha-leader 172.31.105.254:8301 alive server 1.8.4 2 us-e1 <all>
consul-bravo 172.31.119.198:8301 alive server 1.8.4 2 us-e1 <all>
consul-charlie 172.31.119.98:8301 alive server 1.8.4 2 us-e1 <all>
consul-delta 172.31.101.101:8301 alive server 1.8.4 2 us-e1 <all>
consul-echo 172.31.107.165:8301 alive server 1.8.4 2 us-e1 <all>
vault-alpha 172.31.119.46:8301 alive client 1.8.5 2 us-e1 <default>
vault-bravo 172.31.126.217:8301 alive client 1.8.5 2 us-e1 <default>
vault-charlie 172.31.105.114:8301 alive client 1.8.5 2 us-e1 <default>
That vault status command is handy, and that givees me something new to chase down. I did a self-signed cert based on the DNS name I intend to use for this service, but right now it’s just routing via IP addresses. I’ll try to fix that and see what happens.
Update:
vault status now shows the following message after I (temporarily) set an IP SAN cert in place (seems to have cleared the original error, but this may still be related):
Error checking seal status: Get "https://127.0.0.1:8200/v1/sys/seal-status": dial tcp 127.0.0.1:8200: connect: connection refused
From your first post I would have said it’s the wrong acl policy attached to the token. But your default_policy is to allow everything. So that shouldn’t be the blocking factor.
You are using ips instead of hostnames… Maybe this is wrong:
verify_server_hostname = true
But I don’t really know if your error message could point to this.
Your api_address points to the not-loopback-address. Did you try using the status against this address?
VAULT_ADDR="https://172.31.119.46:8200" vault status
Thanks everyone for helping out.
The ‘connection refused’ error was because I needed to chmod my cert key file properly. Once I did that, I got another error stating
x509: certificate signed by unknown authority
I played around with this for a while, and gave up trying to use a self-signed cert. I was planning on using a DNS cert anyway (but hadn’t during the setup phase), and once I installed it, I got rid of that error.
Once I finally got the vault.hcl properly reconfigured for the DNS name and redid all my EXPORT declarations, it looks like it’s working. Both services are currently running without errors. I can get to the UI, which shows the vault as locked, but I assume it’s because I haven’t done the initial setup yet. I think we’re good on this one for now. Thanks again!