Fault tolerance in service registration?

Hi, a question on making device registrations and health checks more fault tolerant.

If I:

  1. Register a service and health check on an agent (using http /agent/service/register)
  2. Simulate a fault on the agent (ie kill -9)

Then the service and health check from that agent are failed|deregistered, even though the service itself is still running ok.

This matches my understanding based on the docs, eg:


What is the recommended approach to build in fault tolerance to our service discovery for clients?

Currently our service just registers itself on startup. Is it expected to also poll the catalog, detect it’s missing and re-register itself? Is there a better approach?

Thanks