I recently upgraded our Nomad clusters from 0.8.6 to 0.9.2 and am seeing sporadic healthcheck failures from the Nomad servers. I’ve got the following healthcheck configured on an AWS ELB that drives an Auto-Scaling Group:
Healthy Threshold: 3
Unhealthy Threshold: 3
Timeout: 10 (seconds)
Interval: 60 (seconds)
Success codes: 200-308
This configuration worked with 0.8.6
I see in the syslog that it is attempting to connect to other Nomad clusters in other regions that it can’t connect to:
Jun 21 22:56:21 ip-10-132-12-215 nomad: 2019-06-21T22:56:21.511Z [INFO ] nomad: memberlist: Suspect ip-10-132-37-100.prod-us-west-2 has failed, no acks received Jun 21 22:56:21 ip-10-132-12-215 nomad: 2019-06-21T22:56:21.851Z [ERROR] nomad: memberlist: Push/Pull with ip-10-132-53-227.prod-eu-west-2 failed: dial tcp 10.132.53.227:4648: i/o timeout
I think might be what is causing the peers endpoint to fail. Is there a way to only have it return peers within its region (internal-us-west-2)? Is there a more correct healthcheck endpoint to be using?
I’ve read in the docs that nomad regions should be federated to all other regions but we need to segment the different regions.