I have a consul cluster with vault deployed to docker swarm. Like this:
version: “3.9”
services:
consul:
image: consul:latest
hostname: consul{{.Task.Slot}}
environment:
CONSUL_BIND_INTERFACE: eth1
ports:
- 8500:8500
command: agent -server -bootstrap-expect 3 -ui -client 0.0.0.0 -retry-join consul1 -retry-join consul2 -retry-join consul3
deploy:
replicas: 3
vault:
image: vault:latest
command: server
hostname: vault{{.Task.Slot}}
environment:
VAULT_LOCAL_CONFIG: '{"storage": {"consul": {"address": "http://consul:8500" }}, "listener": { "tcp": { "address": "0.0.0.0:8200", "tls_disable": 1 } }, "ui": true, "default_lease_ttl": "168h", "max_lease_ttl": "720h", "disable_mlock": true }'
VAULT_CLUSTER_INTERFACE: eth0
SKIP_SETCAP: 1
VAULT_ADDR: http://localhost:8200
VAULT_API_ADDR: http://vault{{.Task.Slot}}:8200
deploy:
replicas: 2
ports:
- 8200:8200
For some reason, after this has been running for a while, the consul dashboard will indicate that there are 3 or more instances of vault - 2 healthy entries, and a number of spare unhealthy entries that have exceeded timeouts - the differentiating factor is that, at some point it seems docker recreated vault:vault2:8200 (as an example) and it connected to a different consul instance, and the old vault:vault2:8200 did not get unregistered.
But now these dead entries are stuck and I don’t know how to get rid of them. And I don’t know if theres something I should do to make my deployment more resilient to these appearing in the first place.