I have a service that requires a DB to be up and running. As they are intrinsically related, I am trying to keep them together in the same job.
However, if I put them in the same group as separate tasks, I have observed the following:
The db will start coming up
The service will come up, and attempt to connect to the db
This fails, and the service crashes
The failure is noted by nomad, and the both tasks in the group are terminated to try again.
Obviously, the second attempt has the same result, as the DB now must go through initialization once again.
So, I attempted to split it into two task groups. However, whichever group I place first will be allocated and executed, and apparently the job is waiting for that group to finish to do the next one. So even if I put the DB first, the service will never be allocated and thus never come up.
This seems like such a common use case, I assume I am missing something obvious. What is the recommended approach to do this in nomad?
I split the job into two separate jobs, and one of the two is still not being allocated. So it’s possible there is some underlying issue which is causing confusion about groups and tasks.
The output of the eval status is:
host# nomad eval status dc4cb8eb
ID = dc4cb8eb
Create Time = 13h21m ago
Modify Time = 13h21m ago
Status = blocked
Status Description = created to place remaining allocations
Type = system
TriggeredBy = queued-allocs
Priority = 50
Placement Failures = N/A - In Progress
Task Group "test" (failed to place 1 allocation):
Unfortunately, I am not sure how to proceed to debug this.
I am not specifying any resources in the task definition, I was not aware of that requirement, do you have more info on it?
I found the resolution to that particular problem seems to have been an issue where I did not remove the db port usage within the network stanza of my second job. Thus, the second job could not allocate since the port was in use by the first job. It seems there could be improvement in reporting there, since there currently is no information to help track down the problem.
Unfortunately, it does not move me closer to an understanding of why the groups are behaving the way they are.
But I had placed it at the job level, not the group level, and it was preventing the allocation from being deployed. Unfortunately, as with the ports, there does not seem to be much visibility into why a decision like that is being made, so silly errors like mine must be solved through experimentation and thinking really hard, rather than debug information.