Question: Consquences of Push/Pull Error of Clients

Hello Consul Community,

i am currently researching the security aspect of consul in an Enterprise Environment. My task is to evaluate if its possible to satisfy the security requirements with the open source version of consul. I am planning to use Consul as a KV-Storage for a Patroni Cluster.

The critical part is the communication (gossip) between all nodes in one datacenter. This means clients from different customers (databases) talk to eachother over UPD and TCP (Port 8301). It is not possible to prohibit Gossip between Clients using ACLs. So i decided to use firewall rules to limit communication between clients from different customers.

This results in expected “memberlist: PUSH/PULL Errors” between those clients.
The documentation warns about “Health flapping” : https://learn.hashicorp.com/consul/day-2-operations/network-segments

But i was not able to observe those Problems in my current test system.

So thats my question. Are there any known issues occuring by limiting client communication?

Hi Fablip

Preventing Consul agents from talking to each other does harm your cluster.

The gossip that agents do though is only exchanging health information. It forms the baseline resilient failure detection mechanism that Consul relies on - no end-user data is exchanged between agents though.

In general Consul is designed as an infrastructure tool where it runs on all your nodes and forms a resilient control plane to build on top of. The pattern is you let Consul agents talk in a full mesh but then restrict individual client access to state via ACL - the gossip doesn’t contain sensitive per-client data. All actual KV requests are made via RPC direct to servers authenticated by ACL (and hopefully over TLS if Consul is setup securely).

You can definitely use Consul KV for multiple tenants and use ACL to restrict access between each client, but limiting Gossip between agents will at least weaken the failure detector and in our experience will result in health checks failing more often possibly affecting your availability. At worst, it’s breaking the main design of Consul as a reliable cluster manager which means you could run into issues with things like rotating gossip keys, reliably having nodes join and leave etc.

Does that help?

1 Like