Allocation lost after client restart

sevensolutions · January 11, 2024, 9:47am

Working on my IIS plugin i’am experiencing some behavior, i didn’t expect.
Following procedure:

I’am running a server and a client agent
I’am scheduling a simple IIS job and wait for it to become healthy
I stop the client agent
The alloc get’s lost
I restart the client agent
The job gets recovered

But now the strange thing happens:
The job gets recovered successfully, but immediately after recovering, it gets stopped and rescheduled.

I also tried to disable rescheduling completely via

reschedule {
    attempts  = 0
    unlimited = false
}

but this also doesn’t help.

From my plugin logs i can see:

10:29:01 [] INF NomadIIS.Services.Grpc.DriverService: Recovering task e0d01b3f-7991-8676-8e8b-fffa8e9fb168/iis-test/550e0870 (Alloc: e0d01b3f-7991-8676-8e8b-fffa8e9fb168)...
10:29:01 [] INF NomadIIS.Services.Grpc.DriverService: Recovered task e0d01b3f-7991-8676-8e8b-fffa8e9fb168/iis-test/550e0870 from state: Website: nomad-e0d01b3f-7991-8676-8e8b-fffa8e9fb168-iis-test, AppPool: nomad-e0d01b3f-7991-8676-8e8b-fffa8e9fb168-iis-test, StartDate: 11.01.2024 10:27:31
10:29:01 [] INF NomadIIS.Services.Grpc.DriverService: Stopping task e0d01b3f-7991-8676-8e8b-fffa8e9fb168/iis-test/550e0870 (Alloc: e0d01b3f-7991-8676-8e8b-fffa8e9fb168)...
10:29:01 [] INF NomadIIS.Services.Grpc.DriverService: Starting task e2a30607-3405-a31f-66c6-85850cfa9560/iis-test/e663bc93 (Alloc: e2a30607-3405-a31f-66c6-85850cfa9560)...
10:29:01 [] INF NomadIIS.Services.Grpc.DriverService: Task e2a30607-3405-a31f-66c6-85850cfa9560/iis-test/e663bc93: Creating AppPool with name nomad-e2a30607-3405-a31f-66c6-85850cfa9560-iis-test...
10:29:01 [] INF NomadIIS.Services.Grpc.DriverService: Task e2a30607-3405-a31f-66c6-85850cfa9560/iis-test/e663bc93: Creating Website with name nomad-e2a30607-3405-a31f-66c6-85850cfa9560-iis-test...

Does anyone have an idea, why this is happening?
Is this a bug or intended bahvior?

sevensolutions · January 11, 2024, 10:55am

I now also tried to switch on the “RemoteTaskDriver” mode but this also doesn’t work because of Remote Task Driver fails to propagate task handles when no clients are immediately available · Issue #10592 · hashicorp/nomad · GitHub.

I mean in theory i could change the Website and AppPool naming to the job/task-name instead of the alloc id but the main problem is, that i also get a new network port assigned.

sevensolutions · January 11, 2024, 11:24am

I think i found a solution. There’s a job setting called prevent_reschedule_on_lost

This will mark the alloc as “unknown” and reconnect it once the client comes back.