Nomad Job Alloc Frequent Intermittent error because of failed to create an alloc vault token

Hi all,

Recently we faced an issue after We run nomad with vault integration, after runnin nomad approx. 1 month.

so from high level infrastructure overview, we’re have centralized vault server in one aws region us-west-2, then we have 2 nomad cluster, one in us-west-2, one in us-east-1,

then, we’re running nomad integrated with vault for secret management.

but recently we got an intermittent error like below in our nomad server logs :

Oct 19 12:44:20 ip-10-12-67-130 nomad[16176]:     2022-10-19T12:44:20.742Z [WARN]  nomad.vault: failed to revoke tokens. Will reattempt until TTL:
Oct 19 12:44:20 ip-10-12-67-130 nomad[16176]:   error=
Oct 19 12:44:20 ip-10-12-67-130 nomad[16176]:   | failed to revoke token (alloc: "c51853ba-0dc2-84da-3322-23157d3a5bea", node: "03fc3339-baac-9021-848d-43e7e977d3c2", task: "armada-worker"): Error making API request.
Oct 19 12:44:20 ip-10-12-67-130 nomad[16176]:   |
Oct 19 12:44:20 ip-10-12-67-130 nomad[16176]:   | URL: POST https://vault-server-us-west-2-88002.development.armada.accelbyte.io:8200/v1/auth/token/revoke-accessor
Oct 19 12:44:20 ip-10-12-67-130 nomad[16176]:   | Code: 403. Errors:
Oct 19 12:44:20 ip-10-12-67-130 nomad[16176]:   |
Oct 19 12:44:20 ip-10-12-67-130 nomad[16176]:   | * permission denied
Oct 19 12:44:20 ip-10-12-67-130 nomad[16176]:   
Oct 19 12:44:21 ip-10-12-67-130 nomad[16176]:     2022-10-19T12:44:21.070Z [WARN]  nomad.vault: failed to revoke tokens. Will reattempt until TTL:
Oct 19 12:44:21 ip-10-12-67-130 nomad[16176]:   error=
Oct 19 12:44:21 ip-10-12-67-130 nomad[16176]:   | failed to revoke token (alloc: "c51853ba-0dc2-84da-3322-23157d3a5bea", node: "03fc3339-baac-9021-848d-43e7e977d3c2", task: "armada-worker"): Error making API request.
Oct 19 12:44:21 ip-10-12-67-130 nomad[16176]:   |
Oct 19 12:44:21 ip-10-12-67-130 nomad[16176]:   | URL: POST https://vault-server-us-west-2-88002.development.armada.accelbyte.io:8200/v1/auth/token/revoke-accessor
Oct 19 12:44:21 ip-10-12-67-130 nomad[16176]:   | Code: 403. Errors:
Oct 19 12:44:21 ip-10-12-67-130 nomad[16176]:   |
Oct 19 12:44:21 ip-10-12-67-130 nomad[16176]:   | * permission denied
Oct 19 12:44:21 ip-10-12-67-130 nomad[16176]:   
Oct 19 12:44:37 ip-10-12-67-130 nomad[16176]:     2022-10-19T12:44:37.071Z [ERROR] nomad.client: Vault token creation for alloc failed: alloc_id=cba883f3-4804-4161-43c6-3ba8ed526ae8
Oct 19 12:44:37 ip-10-12-67-130 nomad[16176]:   error=
Oct 19 12:44:37 ip-10-12-67-130 nomad[16176]:   | failed to create an alloc vault token: Error making API request.
Oct 19 12:44:37 ip-10-12-67-130 nomad[16176]:   |
Oct 19 12:44:37 ip-10-12-67-130 nomad[16176]:   | URL: POST https://vault-server-us-west-2-88002.development.armada.accelbyte.io:8200/v1/auth/token/create/nomad-server-aws-us-east-1-001
Oct 19 12:44:37 ip-10-12-67-130 nomad[16176]:   | Code: 403. Errors:
Oct 19 12:44:37 ip-10-12-67-130 nomad[16176]:   |
Oct 19 12:44:37 ip-10-12-67-130 nomad[16176]:   | * permission denied
Oct 19 12:44:37 ip-10-12-67-130 nomad[16176]:   

nomad job alloc events :

Oct 19, '22 21:12:32 +0700	Template	Missing: vault.read(aws/creds/nomad-autoscaler), vault.read(secrets/data/nomad/management-token), vault.write(pki_intermediate/issue/nomad-cli -> b07ed58d)
Oct 19, '22 21:12:29 +0700	Alloc Unhealthy	Unhealthy because of failed task
Oct 19, '22 21:12:29 +0700	Killing	Vault: server failed to derive vault token: failed to create an alloc vault token: Error making API request. URL: POST https://vault-server-us-west-2-88002.example.io:8200/v1/auth/token/create/nomad-server-aws-us-east-1-001 Code: 403. Errors: * permission denied

above problem will disappear when we renew the token, change the token that specified in the config file, then send SIGHUP to our nomad process in the instance.

we faced this too frequent, almost per 30-45m ones.

vault block config example for nomad us-west-2 region: :

...
vault {
  enabled = true
  address = "https://vault-server-us-west-2-88002.example.io:8200"

  ca_file         = "/opt/vault/tls/ca.crt"
  cert_file       = "/opt/vault/tls/tls.crt"
  key_file        = "/opt/vault/tls/tls.key"
  tls_server_name = "vault"

  allow_unauthenticated = true
  create_from_role      = "nomad-server-aws-us-west-2-001"
  token                 = "hvs.XXXXXZZZZZZZZ"
}
...

vault policy example for nomad us-west-2 region:

path "secrets/data/nomad-server/aws/us-west-2/001/*" {
  capabilities = [ "create", "read" , "update" ]
}

path "secrets/data/nomad/*" {
  capabilities = [ "create", "read" , "update" ]
}

# Allow creating tokens under "nomad-server-aws-us-west-2-001" role. The role name should be
# updated if "nomad-server-aws-us-west-2-001" is not used.
path "auth/token/create/nomad-server-aws-us-west-2-001" {
  capabilities = ["create", "update"]
}

# Allow looking up "nomad-server-aws-us-west-2-001" role. The role name should be updated if
# "nomad-server-aws-us-west-2-001" is not used.
path "auth/token/roles/nomad-server-aws-us-west-2-001" {
  capabilities = ["read"]
}

# Allow looking up the token passed to Nomad to validate the token has the
# proper capabilities. This is provided by the "default" policy.
path "auth/token/lookup-self" {
  capabilities = ["read"]
}

# Allow looking up incoming tokens to validate they have permissions to access
# the tokens they are requesting. This is only required if
# `allow_unauthenticated` is set to false.
path "auth/token/lookup" {
  capabilities = ["update"]
}

# Allow revoking tokens that should no longer exist. This allows revoking
# tokens for dead tasks.
path "auth/token/revoke-accessor" {
  capabilities = ["update"]
}

# Allow checking the capabilities of our own token. This is used to validate the
# token upon startup.
path "sys/capabilities-self" {
  capabilities = ["update"]
}

# Allow our own token to be renewed.
path "auth/token/renew-self" {
  capabilities = ["update"]
}

we already separate the vault token roles between nomad server, as you see above,

  • us-west-2 : nomad-server-us-west-2-88002-001
  • us-east-1 : nomad-server-us-east-1-88002-001

vault token roles, example output :

❯ vault read auth/token/roles/nomad-server-aws-us-east-1-001
Key                         Value
---                         -----
allowed_entity_aliases      []
allowed_policies            []
allowed_policies_glob       []
disallowed_policies         []
disallowed_policies_glob    [*consul-server* *nomad-server*]
explicit_max_ttl            0s
name                        nomad-server-aws-us-east-1-001
orphan                      true
path_suffix                 n/a
period                      0s
renewable                   true
token_explicit_max_ttl      0s
token_no_default_policy     false
token_period                72h
token_type                  default-service

❯ vault read auth/token/roles/nomad-server-aws-us-west-2-001
Key                         Value
---                         -----
allowed_entity_aliases      []
allowed_policies            []
allowed_policies_glob       []
disallowed_policies         []
disallowed_policies_glob    [*consul-server* *nomad-server*]
explicit_max_ttl            0s
name                        nomad-server-aws-us-west-2-001
orphan                      true
path_suffix                 n/a
period                      0s
renewable                   true
token_explicit_max_ttl      0s
token_no_default_policy     false
token_period                72h
token_type                  default-service

terraform code part, that handle token creation :

resource "vault_policy" "nomad_server" {
  name = format("%s-%s-%s-%s", var.service, var.provider_id, local.aws_region, var.identifier_id)

  policy = templatefile(format("%s/manifest/nomad-server-vault-policy.hcl.tmpl", path.module), {
    vault_kv2_path          = var.nomad_server_config.vault_kv2_path
    service                 = var.service
    provider_id             = var.provider_id
    region                  = local.aws_region
    identifier_id           = var.identifier_id
    nomad_server_vault_role = vault_token_auth_backend_role.nomad_server.role_name
  })
}

resource "vault_token_auth_backend_role" "nomad_server" {
  role_name    = format("%s-%s-%s-%s", var.service, var.provider_id, local.aws_region, var.identifier_id)
  orphan       = true
  token_period = "259200"
  renewable    = true
  disallowed_policies_glob = [
    "*nomad-server*",
    "*consul-server*",
  ]
  token_explicit_max_ttl = 0
}

resource "vault_token" "nomad_server" {
  policies = flatten(concat(
    var.nomad_server_config.additional_vault_token_policies,
    [vault_policy.nomad_server.name],
    ["default"],
  ))

  renewable = true
  period    = "72h"
  metadata = {
    "purpose" = "nomad-server-token"
  }
}

can anyone help me to point out whats wrong with our setup ?
any provided solution, since before it was fine, when running almost 1 month, but now we faced this issue.

Does using Nomad Management Token in our job stanza affect this behavior?

Did you find a solution? Having same issue