Nomad Peer Healthchecks ~ 0.9.2

mikeblum · June 21, 2019, 10:38pm

Hi,

I recently upgraded our Nomad clusters from 0.8.6 to 0.9.2 and am seeing sporadic healthcheck failures from the Nomad servers. I’ve got the following healthcheck configured on an AWS ELB that drives an Auto-Scaling Group:

Instances: 3

Protocol: HTTP
Path: /v1/status/peers
Port: 4646
Healthy Threshold: 3
Unhealthy Threshold: 3
Timeout: 10 (seconds)
Interval: 60 (seconds)
Success codes: 200-308

This configuration worked with 0.8.6

I see in the syslog that it is attempting to connect to other Nomad clusters in other regions that it can’t connect to:

Jun 21 22:56:21 ip-10-132-12-215 nomad[1416]:     2019-06-21T22:56:21.511Z [INFO ] nomad: memberlist: Suspect ip-10-132-37-100.prod-us-west-2 has failed, no acks received
Jun 21 22:56:21 ip-10-132-12-215 nomad[1416]:     2019-06-21T22:56:21.851Z [ERROR] nomad: memberlist: Push/Pull with ip-10-132-53-227.prod-eu-west-2 failed: dial tcp 10.132.53.227:4648: i/o timeout

I think might be what is causing the peers endpoint to fail. Is there a way to only have it return peers within its region (internal-us-west-2)? Is there a more correct healthcheck endpoint to be using?

I’ve read in the docs that nomad regions should be federated to all other regions but we need to segment the different regions.

Topic		Replies	Views
Understanding how Nomad does healthchecks and avoid false positive error logs Nomad	0	348	August 28, 2023
Nomad service health check https with nomad service provider Nomad	0	423	November 16, 2022
Nomad client health checks randomly starts failing Nomad health-check , consul-nomad , nomad	0	91	April 17, 2024
Health Checks Failing: Consul & Nomad Consul consul-nomad , nomad	3	92	September 12, 2024
My consul health check fails on nomad-client Nomad	3	2798	November 21, 2021

Nomad Peer Healthchecks ~ 0.9.2

Instances: 3

Related topics