Cluster is unhealthy after weekly patching/reboot

sammy676776 · May 18, 2023, 10:17pm

My Nomad cluster comes up fine but the consul does not after weekly patching and reboot .
They all come up with 2-3 minutes but they go into a bad state and I have to restart them manually …

Observsation: My Nomad cluster works fine and jobs run fine however I cannot get into consul UI and when I so “systemctl status consul” it does throw some error . Only a restart of all 3 nodes makes it normal.
Here is the error …I think Consul agents shut themselves down since there is no cluster leader ?

}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->xxxxx:8303: operation was canceled"
2023-05-10T22:38:25.889-0400 [WARN]  agent: [core][Channel #1 SubChannel #6588] grpc: addrConn.createTransport failed to connect to {
  "Addr": "xxxx:8303",
  "ServerName": "xxxxxx",
  "Attributes": null,
  "BalancerAttributes": null,
  "Type": 0,
  "Metadata": null
}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->xxxx:8303: operation was canceled"
2023-05-10T22:40:39.292-0400 [INFO]  agent: Deregistered service: service=_nomad-server-jox3pd5xjcasphhcxmrqk2tk5ymoieo7
2023-05-10T22:40:39.364-0400 [INFO]  agent: Caught: signal=interrupt
2023-05-10T22:40:39.364-0400 [INFO]  agent: Graceful shutdown disabled. Exiting
2023-05-10T22:40:39.364-0400 [INFO]  agent: Requesting shutdown
2023-05-10T22:40:39.385-0400 [INFO]  agent.server: shutting down server
2023-05-10T22:40:39.385-0400 [WARN]  agent.server.serf.lan: serf: Shutdown without a Leave
2023-05-10T22:40:39.388-0400 [WARN]  agent.server.serf.wan: serf: Shutdown without a Leave
2023-05-10T22:40:39.389-0400 [INFO]  agent.router.manager: shutting down
2023-05-10T22:40:39.389-0400 [INFO]  agent.router.manager: shutting down
2023-05-10T22:40:39.403-0400 [INFO]  agent: consul server down
2023-05-10T22:40:39.403-0400 [INFO]  agent: shutdown complete
2023-05-10T22:40:39.403-0400 [INFO]  agent: Stopping server: protocol=DNS address=0.0.0.0:8600 network=tcp
2023-05-10T22:40:39.403-0400 [WARN]  agent.cache: handling error in Cache.Notify: cache-type=connect-ca-root error="rpc error making call: EOF" index=9
2023-05-10T22:40:39.404-0400 [INFO]  agent: Stopping server: protocol=DNS address=0.0.0.0:8600 network=udp
2023-05-10T22:40:39.404-0400 [INFO]  agent: Stopping server: address=[::]:8501 network=tcp protocol=https
2023-05-10T22:40:39.404-0400 [WARN]  agent: Deregistering service failed.: service=_nomad-server-sqq5ryvwb6hlkab3ymogypudt7ckxbge error="No cluster leader"
2023-05-10T22:40:39.404-0400 [ERROR] agent: failed to sync changes: error="No cluster leader"
2023-05-10T22:40:39.404-0400 [INFO]  agent: Stopping server: address=127.0.0.1:8500 network=tcp protocol=http
2023-05-10T22:40:39.406-0400 [INFO]  agent: Waiting for endpoints to shut down
2023-05-10T22:40:39.406-0400 [INFO]  agent: Endpoints down
2023-05-10T22:40:39.406-0400 [INFO]  agent: Exit code: code=1

Any advise ?

maxb · May 19, 2023, 8:48am

There is not enough information here to give much advice.

The logging you have shown looks like nothing more than the Consul agent being shut down for the reboot to me, and never being restarted at all after the reboot.

No, they don’t do that. The cause of the shutdown above is this externally delivered signal:

sammy676776 · May 23, 2023, 7:01pm

Thanks . After looking deeper into the issue I see that after consul tries several times and quitting …the Baremetal is still running some network scripts . I am going to add a dependency check in the systemd unit file to wait until all the network and other things come up before we start Nomad and consul

Topic		Replies	Views
Errors in new Consul cluster Consul	3	2219	February 26, 2023
Nomad and Consul Weirdness... Terminated connections galore! Nomad consul	0	180	September 27, 2023
WARN addrConn.createTransport followed by Consul servers entering critical state Consul	4	1582	October 26, 2022
Consul Service check on Nomad server has errors Consul consul-nomad	5	2197	May 11, 2020
3-node cluster unhealthy after leader lost network connection Consul	3	3885	March 4, 2021

Cluster is unhealthy after weekly patching/reboot

Related topics