Nomad 1.0.2 facing Dimension Disk Exhausted on 3 nodes

Hi, I’ve got a 3 node server cluster with a 5 node client. Allocation were running fine and sometimes I gets the “Dimension Disk Exhausted on 3 nodes” error. I came across a bug report on Nomad saying this is because the Nomad Client node follows a static fingerprint, which send client node resource stats at start up to Nomad server and hence this issue. I would like to understand the design behind the client nodes resource fingerprinting and how does that work? Thanks.

Hi @anburethy :wave:

I think the most common cause of this is if you are using the default values for logs. If you don’t specify a logs block in your task, it will reserve 100 MB of disk for each task (10 files x 10 MB per file).

Try specifying lower values and see if it helps.

Your observation is correct. When a Nomad client starts, it fingerprints the host its running on and sends this information to the servers so they can perform scheduling decisions. You can see what kind of data is fingerprinted from the UI (Inspect the Cluster | Nomad | HashiCorp Developer).

Fingerprinting is only done when clients start to avoid constant rescheduling. When something changes in your cluster (a client is added, or goes down), the servers need to re-asses its scheduling decisions and potentially move things around. This is an expensive operation, so it’s only done when required. Continuous fingerprint would cause this reassessment to happen all the time.

1 Like

Thanks for you reply @lgfa29

I think the most common cause of this is if you are using the default values for logs . If you don’t specify a logs block in your task, it will reserve 100 MB of disk for each task (10 files x 10 MB per file).

Try specifying lower values and see if it helps.

My concern here is this disk allocation issue is happening intermittently and I also verified that there is enough space in each client nodes. Ideally when this error pops up i do a drain on the node and then try deploying again and it works. I feel that the Nomad Server isnt knowing the real status of the client nodes and thats why its thinking there isnt enough resource in client node when it is evaluating. So this doesnt seem to be actual disk space shortage issue rather its the server node miscalculation. we need a fix in this aspect.

Hum…that’s interesting. Maybe the fingerprint is not reading the right disk? Could check with the nomad node status command if the disk output matches what you would expect to see in all your client nodes?

Also, if you could provide any server logs when this happens again it could be helpful.

Thanks!

EDIT: another job setting that you might want to adjust is the ephemeral_disk size.