We are using Consul to register Nomad services via service and check stanza.
After a service fails and moves nodes, the old (and failed) check ID is still showing in Consul.
I’m not sure this can help but I read about the parameter
DeregisterCriticalServiceAfter in the Consul Check documentation
I also tried my luck adding it to the check stanza, but unfortmnaly Consul supports it but not Nomad:
name = "test"
port = 80
name = "alive"
port = "http"
type = "http"
path = "/"
interval = "10s"
timeout = "2s"
deregister_critical_service_after = "1m"
Hey @Dgotlieb, sorry for the late response on this. This fell through the cracks and just now catching it.
This probably wouldn’t be super difficult to add technically, but I think it leads the user down a bad path. If the Consul service gets deregistered after failing, the Nomad jobspec is no longer representative of the reality in Consul, and becomes less “declarative”.
I think you’ve correctly identified an underlying bug which should probably be reported on it’s own - That Nomad is not properly cleaning up old service health checks when a node moves. If this were fixed, am I right in thinking that you wouldn’t need the DeregisterCriticalServiceAfter support at all? Do you have any more info on how often this is happening and/or repro steps? If so, if you made a GH issue, that would be great - or I can open one for you with any info you provide.
I’m thinking that ideally the underlying issue in Nomad is fixed and then the jobspec can always represent reality.