Inconsistent nomad plan with a CSI volume plugin

Occasionally, when using a CSI storage plugin, I can get inconsistent nomad job plan output.

I think there’s some latency, or state stuff happening when it does the plan check that isn’t quite cleaning up between independent plan runs. I know the storage stuff is in beta, just throught I’d post about it.

> nomad job plan tt-rss.nomad
+ Job: "tt-rss"
+ Task Group: "tt-rss" (1 create)
  + Task: "pgsql-database" (forces create)
  + Task: "ttrss-app" (forces create)
  + Task: "ttrss-app-updater" (forces create)

+ Task Group: "web" (1 create/destroy update)
  + Task: "connect-proxy-ttrss-web-nginx" (forces create)
  + Task: "web-nginx" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 tt-rss.nomad

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
> nomad job plan tt-rss.nomad
+ Job: "tt-rss"
+ Task Group: "tt-rss" (1 create)
  + Task: "pgsql-database" (forces create)
  + Task: "ttrss-app" (forces create)
  + Task: "ttrss-app-updater" (forces create)

+ Task Group: "web" (1 create/destroy update)
  + Task: "connect-proxy-ttrss-web-nginx" (forces create)
  + Task: "web-nginx" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "tt-rss" (failed to place 1 allocation):
    * Constraint "${attr.cpu.arch} = amd64": 1 nodes excluded by filter
    * Constraint "CSI volume ttrss-database has exhausted its available writer claims": 1 nodes excluded by filter

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 tt-rss.nomad

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
> nomad job plan tt-rss.nomad
+ Job: "tt-rss"
+ Task Group: "tt-rss" (1 create)
  + Task: "pgsql-database" (forces create)
  + Task: "ttrss-app" (forces create)
  + Task: "ttrss-app-updater" (forces create)

+ Task Group: "web" (1 create/destroy update)
  + Task: "connect-proxy-ttrss-web-nginx" (forces create)
  + Task: "web-nginx" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "tt-rss" (failed to place 1 allocation):
    * Constraint "${attr.cpu.arch} = amd64": 1 nodes excluded by filter
    * Constraint "CSI volume ttrss-database has exhausted its available writer claims": 1 nodes excluded by filter

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 tt-rss.nomad

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
> nomad job plan tt-rss.nomad
+ Job: "tt-rss"
+ Task Group: "tt-rss" (1 create)
  + Task: "pgsql-database" (forces create)
  + Task: "ttrss-app" (forces create)
  + Task: "ttrss-app-updater" (forces create)

+ Task Group: "web" (1 create/destroy update)
  + Task: "connect-proxy-ttrss-web-nginx" (forces create)
  + Task: "web-nginx" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "tt-rss" (failed to place 1 allocation):
    * Constraint "CSI volume ttrss-database has exhausted its available writer claims": 1 nodes excluded by filter
    * Constraint "${attr.cpu.arch} = amd64": 1 nodes excluded by filter

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 tt-rss.nomad

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
> nomad job plan tt-rss.nomad
+ Job: "tt-rss"
+ Task Group: "tt-rss" (1 create)
  + Task: "pgsql-database" (forces create)
  + Task: "ttrss-app" (forces create)
  + Task: "ttrss-app-updater" (forces create)

+ Task Group: "web" (1 create/destroy update)
  + Task: "connect-proxy-ttrss-web-nginx" (forces create)
  + Task: "web-nginx" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 tt-rss.nomad

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
>

Hi @BeepDog!

I think there’s some latency, or state stuff happening when it does the plan check that isn’t quite cleaning up between independent plan runs.

The CSI volume claim has to be cleaned up after the allocation has been stopped, and we can only give the claim back to the scheduler once that operation is done. For some storage plugins (ex. AWS EBS), this can take quite a while because we have to detach the volume and the cloud provider API unfortunately just takes 10s of seconds to do so.

Even during a plan, when I haven’t allocated anything yet?

Oh, sorry. So you’re running nomad job plan and then nomad job plan again without having ever run the job? In that case there definitely should not have been any volume claims made! Can you open a GitHub issue with the Nomad version and a job spec?

1 Like