Hi. I’m facing a weird situation between Vault and Consul. Maybe someone here can help me. I have a 5-node Consult cluster and a 5-node Vault cluster, both using latest versions. This uses 5 machines only, each machine holds a member for each service cluster. Vault reports directly to the local Consul server agent. These 5 machines span 3 “geographic/network zones”. One zone contains only one node. There was an issue with one of the zones, so two nodes were isolated from the other 3. But that was temporary. The problem I’m seeing now is that although there is only one active/leader Vault node, Consul DNS and service check metric insist to report that two Vault nodes are active, which is not true. For example, DNS querying active.vault.service.mydc.consul alternates between two Vault nodes, and the service check metrics collected from Consul also report those two same nodes. I have no idea what’s going on here. Any idea? TIA.
I’m assuming you have a 5 node vault system ( using integrated storage ? ) and a 5 node consul for service discovery. On each server - 1 vault node and 1 consul node.
Look at the vault logs to see if the leadership is stable or if it really does change.
In the documentation there are time limits for communications between HA nodes - this is true for Consul ( Consul Reference Architecture | Consul | HashiCorp Developer) and for Vault ( https://developer.hashicorp.com/vault/tutorials/day-one-raft/raft-reference-architecture#network-latency-and-bandwidth) . It takes time to replicate the state between the nodes ( 3 of 5 ) - and if this is too slow, you may still get some stale states.
For your DNS results - look at the TTL values returned ( use dig If you can ). DNS caches a lot, so unless you are directly querying the Consul node, you may be getting data from caches.
Hi, thanks for replying, first of all.
Currently stable, no changes, looking at logs, vault status and vault operator raft list-peers.
I do use some frontend recursors, but when querying Consul nodes directly (and individually) they all alternate between one active node and another.
This goes along with the fact that Consul’s internal service check metrics also alternate between the two leaders.
I’m starting to suspect this is not a Consul issue at all, but instead the non-active Vault node is still insisting in reporting as active to Consul. I’ve tried to find that on the logs but can’t nothing related to the built-in service check mentions anything other than being sealed or not.
Although, querying /sys/health on each Vault node, only one is reporting as active as well. So this is probably just Vault+s internal consul client that is confused and reporting wrong state, although just speculating here.
You never mentioned if you have restarted anything, but this would be my next steps to “clear things up”. It’s your system and you know it best though
There is an API call to have a vault node “step down” from active. This forces an election - do that on the active nodes one at a time. This might make the system converge.
Vault CLI :
Since you are in an odd state, and the above does require root or sudo policy, you could also stop the vault service you think is not reporting correctly - and observe the results once it is down. Next would be to do the same with the consul nodes - maybe the local nodes are not all talking to each other. Final step would be to reboot hosts, making sure you are always meeting quorum.
A quick thought, since you are crossing geo/network zones, maybe there is a path getting blocked in the updates. Something to keep an eye out for as you restart services. Don’t be too aggressive !!
Also, while looking through the sys API end point, there are a few other API calls that may provide more information - /sys/ha-status, /sys/leader along with the standard /sys/seal-status and /sys/health.
Thanks, I’m going to try the other endpoints, see what I get from that. Ant the step-down is a good idea indeed. I’m trying to avoid any change until I can get a better sense of what’s going on.
Forgot to mention, I have restart all the Consul nodes. That made no difference. That’s what made me think the problem lies in Vault service checked reporting in the the fake leader node.
I wish you good luck. I would love to know how it turns out. Short of connecting to the systems, I’m not certain I can add much more.
Unfortunately I inadvertently caused a restart of the “fake” leader due to my Puppet automation, and that cleared the issue, so not much more to analyze now (until next time). Thanks for your help!