Nomad servers crash within a few seconds of starting with SIGSEGV panic

CarbonCollins · September 11, 2021, 1:57pm

Hi,

I’m having trouble with my 3 Nomad servers where each of them is refusing to run for more than a few seconds before going into a panic. I recently started doing some testing with CSI volumes which as far as I can tell seems to be what’s causing the panic as its the last error in the logs before the SIGSEGV signal is thrown.

My clients nodes seem to be fine (minus the fact that there are no active servers right now).

The 3 servers boot and seem to fail with very similar errors:

Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.387+0200 [INFO]  nomad: adding server: server="proxima-b.global (Addr: 192.168.20.98:4647) (DC: proxima)"
Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.390+0200 [INFO]  nomad: serf: EventMemberJoin: proxima-f.global 192.168.20.94
Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.391+0200 [INFO]  nomad: serf: EventMemberJoin: proxima-e.global 192.168.20.95
Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.391+0200 [INFO]  nomad: serf: Re-joined to previously known node: proxima-e.global: 192.168.20.95:4648
Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.391+0200 [INFO]  nomad: adding server: server="proxima-f.global (Addr: 192.168.20.94:4647) (DC: proxima)"
Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.392+0200 [INFO]  nomad: adding server: server="proxima-e.global (Addr: 192.168.20.95:4647) (DC: proxima)"
Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.593+0200 [WARN]  nomad.raft: failed to get previous log: previous-index=484732 last-index=484718 error="log not found"
Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.774+0200 [WARN]  nomad.raft: rejecting vote request since we have a leader: from=192.168.20.95:4647 leader=192.168.20.94:4647
Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.845+0200 [WARN]  nomad.raft: failed to get previous log: previous-index=484734 last-index=484718 error="log not found"
Sep 11 14:52:19 proxima-b nomad[547]:     2021-09-11T14:52:19.136+0200 [ERROR] nomad.fsm: CSIVolumeClaim failed: error=unschedulable
Sep 11 14:52:19 proxima-b nomad[547]: panic: runtime error: invalid memory address or nil pointer dereference
Sep 11 14:52:19 proxima-b nomad[547]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x14 pc=0x702fa8]
Sep 11 14:52:19 proxima-b nomad[547]: goroutine 12 [running]:
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/go-immutable-radix.(*Iterator).Next(0x440f490, 0x0, 0x0, 0x0, 0x0, 0x47cfb64, 0x0)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/go-immutable-radix@v1.3.0/iter.go:178 +0x9c
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/go-memdb.(*radixIterator).Next(0x467fd30, 0x46cd040, 0x42ab1a0)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/go-memdb@v1.3.0/txn.go:895 +0x20
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/nomad/nomad/state.upsertNodeCSIPlugins(0x4612180, 0x44a8fd0, 0x7639f, 0x0, 0x440f3c0, 0x0)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/nomad/nomad/state/state_store.go:1251 +0x27c
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/nomad/nomad/state.upsertNodeTxn(0x4612180, 0x7639f, 0x0, 0x44a8fd0, 0x4612180, 0x4068a80)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/nomad/nomad/state/state_store.go:856 +0x500
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/nomad/nomad/state.(*StateStore).UpsertNode(0x40c5d70, 0x1500, 0x7639f, 0x0, 0x44a8fd0, 0x0, 0x0)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/nomad/nomad/state/state_store.go:804 +0x9c
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/nomad/nomad.(*nomadFSM).applyUpsertNode(0x3ca5c00, 0x76300, 0x4458801, 0x1557, 0x1557, 0x7639f, 0x0, 0x0, 0x0)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/nomad/nomad/fsm.go:352 +0x140
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply(0x3ca5c00, 0x43e1710, 0x6ba71202, 0x3)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/nomad/nomad/fsm.go:211 +0x190
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/raft.(*Raft).runFSM.func1(0x41b3dd0)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/fsm.go:90 +0x204
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/raft.(*Raft).runFSM.func2(0x468a600, 0x40, 0x40)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/fsm.go:113 +0x5c
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/raft.(*Raft).runFSM(0x4110800)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/fsm.go:219 +0x27c
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/raft.(*raftState).goFunc.func1(0x4110800, 0x3cdf060)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/state.go:146 +0x50

This then repeats over and over as systemd restarts the nomad server again.

I’m not really sure how to proceed in this case as all 3 of my servers are unable to stay on for more than a few seconds so my cluster has gone down (luckily this is a homelab so there are no mission critical services running).

edit: Nomad v1.1.4 as I forgot to mention that

jrasell · September 13, 2021, 6:51am

Hi @CarbonCollins and apologies you’re running into this problem. This certainly looks like a problem within Nomad that we should fix. In order to help reproduce this, would you be able to provide any configuration you are using for both the servers and the client agents? If you also have to time, could you please raise this as a bug against the Nomad repository including your original and any further information? This will help get visibility and help roadmap and prioritise any fix.

Thanks,
jrasell and the Nomad team

CarbonCollins · September 13, 2021, 2:19pm

Raised the issue on GitHub which can be found here: https://github.com/hashicorp/nomad/issues/11174

jrasell · September 13, 2021, 2:30pm

thanks @CarbonCollins.

Topic		Replies	Views
Nomad won't start Nomad	7	812	September 18, 2023
CSI controller fails with gPRC error Nomad csi	1	825	August 31, 2023
3-server Nomad cluster seems to become unstable after brief network partition of non-leader server? Nomad	2	1179	October 27, 2022
Nomad 1.0.12 and 1.1.6 Released Release Notifications nomad-release	0	415	October 5, 2021
Nomad Server: ERR_SOCKET_NOT_CONNECTED Nomad	1	427	May 31, 2022

Nomad servers crash within a few seconds of starting with SIGSEGV panic

Related topics