Nomad 1.0.2 facing Dimension Disk Exhausted on 3 nodes

anburethy · April 16, 2021, 6:21pm

Hi, I’ve got a 3 node server cluster with a 5 node client. Allocation were running fine and sometimes I gets the “Dimension Disk Exhausted on 3 nodes” error. I came across a bug report on Nomad saying this is because the Nomad Client node follows a static fingerprint, which send client node resource stats at start up to Nomad server and hence this issue. I would like to understand the design behind the client nodes resource fingerprinting and how does that work? Thanks.

lgfa29 · April 21, 2021, 12:11am

Hi @anburethy

I think the most common cause of this is if you are using the default values for logs. If you don’t specify a logs block in your task, it will reserve 100 MB of disk for each task (10 files x 10 MB per file).

Try specifying lower values and see if it helps.

Your observation is correct. When a Nomad client starts, it fingerprints the host its running on and sends this information to the servers so they can perform scheduling decisions. You can see what kind of data is fingerprinted from the UI (Inspect the Cluster | Nomad | HashiCorp Developer).

Fingerprinting is only done when clients start to avoid constant rescheduling. When something changes in your cluster (a client is added, or goes down), the servers need to re-asses its scheduling decisions and potentially move things around. This is an expensive operation, so it’s only done when required. Continuous fingerprint would cause this reassessment to happen all the time.

anburethy · April 22, 2021, 10:52am

Thanks for you reply @lgfa29

I think the most common cause of this is if you are using the default values for logs . If you don’t specify a logs block in your task, it will reserve 100 MB of disk for each task (10 files x 10 MB per file).

Try specifying lower values and see if it helps.

My concern here is this disk allocation issue is happening intermittently and I also verified that there is enough space in each client nodes. Ideally when this error pops up i do a drain on the node and then try deploying again and it works. I feel that the Nomad Server isnt knowing the real status of the client nodes and thats why its thinking there isnt enough resource in client node when it is evaluating. So this doesnt seem to be actual disk space shortage issue rather its the server node miscalculation. we need a fix in this aspect.

lgfa29 · April 22, 2021, 1:27pm

Hum…that’s interesting. Maybe the fingerprint is not reading the right disk? Could check with the nomad node status command if the disk output matches what you would expect to see in all your client nodes?

Also, if you could provide any server logs when this happens again it could be helpful.

Thanks!

EDIT: another job setting that you might want to adjust is the ephemeral_disk size.

Topic		Replies	Views
Nomad not rescheduling system jobs on nodes that previously ran out of disk space Nomad	2	297	July 7, 2022
Nomad allocations placement Nomad nomad	2	243	March 13, 2024
Nomad not rescheduling allocations due to high usage on one node Nomad	2	4045	March 8, 2021
[ask][nomad] Node Id and Allocation behavior if Nomad Clients instance IP Change Nomad	1	369	July 21, 2021
Nomad 0.11.1 Client Error: node secret ID does not match Nomad	4	1136	May 12, 2020

Nomad 1.0.2 facing Dimension Disk Exhausted on 3 nodes

Related topics