All jobs failing after upgrade from 1.9.7 to 1.10.2

Hi All,
I upgraded my Nomad from 1.9.7 to 1.10.2 and now my all jobs are failing with an error:

Task received by client
Sent interrupt. Waiting 5s before force killing

My Docker and Exec drivers are detected and Nomad and Consul seems to be working just fine.
I even don’t know where to start looking for potential errors.
I would like to know if there are any config changes that I need to make.
My docker version is: 28.3.0

Here is my nomad.hcl on all RHEL 9 clients

datacenter = "dev"
data_dir  = "/opt/nomad/data"
name      = "server2"
bind_addr = "0.0.0.0"

client {
  enabled = true
  servers = ["server1:4647", "server2:4647", "server3:4647"]
  node_pool = "rhel-9x"
  meta{
  "environment" = "dev"
  "os" = "linux"
  "consul_token" = "<CONSUL_TOKEN>"
  }
}

server {
  enabled = false
}


advertise {
  http  = "{{ GetInterfaceIP \"ens192\" }}"
  rpc   = "{{ GetInterfaceIP \"ens192\" }}"
  serf  = "{{ GetIntrfaceIP \"ens192\" }}"
}


consul {
  address                = "localhost:8500"
  server_service_name    = "nomad"
  client_service_name    = "nomad-client"
  checks_use_advertise   = true
  auto_advertise         = true
  server_auto_join       = true
  client_auto_join       = true
  token                  = "<CONSUL_TOKEN>"
  allow_unauthenticated  = false
  ssl                    = false
}

acl {
  enabled = true
}

plugin "docker" {
  config {
    auth {
      config = "/etc/docker-auth.json"
    }
  }
}

Please let me know if you need more information.
Thanks in advance

Any help please??? I need desperate help if someone has faced similar issue.

Hi @andy-22,

I can’t tell you what the problem is from the information given, but the Nomad website has an upgrade guide which covers all the topics that need to be considered when upgrading between versions.

Thanks,
jrasell and the Nomad team

Hi @jrasell,
Thanks for your response..
I want to start by seeing errors.
All I see in 1.10 as a major change that it now requires service_identity and task_identity in consul block. I added that with no effect.

I want to start by seeing errors. Can you point me to what causes the error: “Sent interrupt. Waiting 5s before force killing”?

Again, I appreciate your response

Hi @andy-22,

I think the best place to start will be looking at the client logs for one of the agent where the workload is having problems. This should have some indication of what actions and client is taking and why.

Thanks,
jrasell and the Nomad team

@jrasell,
Looks like I am getting closer. Thanks for the pointer to look at the agent service logs.
I found in error that CNI pluhins were not found. I downloaded https://github.com/containernetworking/plugins/releases/download/v1.7.1/cni-plugins-linux-amd64-v1.7.1.tgz and insalled them in /opt/cni/bin

Now something has changed around getting Consul secrets I think. As it shows in my nomad.hcl, I am using node identity token generated in consul with the following policy.

agent_prefix "" {
  policy = "read"
}

  key_prefix "" {
    policy = "read"
  }

  node_prefix "" {
    policy = "read"
  }

  service_prefix "" {
    policy = "write"
  }

  acl = "read"

Here is my template stanza in my job specification:

 template {
                destination = "${NOMAD_SECRETS_DIR}/envs_1.txt"
                env         = true
                data        = <<EOH
                {{range ls "arch"}}
                    {{.Key}}={{.Value}}
                {{end}}
                EOH
            }

The job was working earlier and now failing because it is relying on secrets from the Consul KV
My consul version is: 1.21.1 (was also upgraded from 1.20.4)

Can you please tell me if any changes needed to get the values from consul KV?

Hi @andy-22,

I’m glad you’ve managed to move it forward. What is the error you are seeing when attempting to read from Consul KV?

Do you have a consul block in your job specification, instructing Nomad to give it a workload identity? If not, I would try adding this.

Thanks,
jrasell and the Nomad team

Hi @jrasell,
Thanks for your response and help.
Currently I don’t have any consul block in my job spec. All I have is a template to get values from consul KV as seen below:

template {
                destination = "${NOMAD_SECRETS_DIR}/envs_1.txt"
                env         = true
                data        = <<EOH
                {{range ls "arch"}}
                    {{.Key}}={{.Value}}
                {{end}}
                EOH
            }

I was looking at the following:

as well as following:

But I don’t understand what changes do I need to make for workload identity and task identity.
Here is my typical consul.hcl from consul server:

datacenter = "dev"
data_dir = "/opt/consul"
client_addr = "0.0.0.0"
ui_config{
  enabled = true
}
server = true
bind_addr = "0.0.0.0"
bootstrap_expect=3
retry_join = ["server1_fqdn", "server2_fqdn", "server3_fqdn"]
node_name = "server1"
log_level = "warn"
log_file = "/var/log/consul/"
log_rotate_duration = "24h"
log_json = false
log_rotate_max_files = 10

Here is my Nomad Server Config:

data_dir  = "/opt/nomad/data"
bind_addr = "0.0.0.0"
log_level = "INFO"
name = "server1"
advertise {

  http  = "{{ GetInterfaceIP \"ens192\" }}"
  rpc   = "{{ GetInterfaceIP \"ens192\" }}"
  serf  = "{{ GetInterfaceIP \"ens192\" }}"
}

server { 
  enabled          = true
  bootstrap_expect = 3
}

consul {
  address                = "127.0.0.1:8500"
  server_service_name    = "nomad"
  client_service_name    = "nomad-client"
  auto_advertise         = true
  server_auto_join       = true
  client_auto_join       = true
  token                  = "<NODE_IDENTITY_TOKEN>"
  allow_unauthenticated  = true
}

acl {
  enabled = true
}

Here is my Nomad Client configuration:

datacenter = "dev"
data_dir  = "/opt/nomad/data"
name      = "client1"
bind_addr = "0.0.0.0"

client {
  enabled = true
  servers = ["server1_fqdn:4647", "server2_fqdn:4647", "server3_fqdn:4647"]
  node_pool = "rhel-9x"

  meta {
    environment = "dev"
    os          = "linux"
  }

}


plugin "docker" {
    config {
      endpoint = "unix:///var/run/docker.sock"
      auth {
        config = "/etc/docker-auth.json"
    }
  }
}

server {
  enabled = false
}


advertise {
  http  = "{{ GetInterfaceIP \"ens192\" }}"
  rpc   = "{{ GetInterfaceIP \"ens192\" }}"
  serf  = "{{ GetIntrfaceIP \"ens192\" }}"
}


consul {
  address                = "localhost:8500"
  server_service_name    = "nomad"
  client_service_name    = "nomad-client"
  checks_use_advertise   = true
  auto_advertise         = true
  server_auto_join       = true
  client_auto_join       = true
  token                  = "<NOMAD_AGENT_NODE_IDENTITY_TOKEN>"
  allow_unauthenticated  = false
  ssl                    = false
}

acl {
  enabled = true
}

How can I configure it so that all jobs and task can use Node Identity for the Nomad client which already have access to KV store by policy:

agent_prefix "" {
  policy = "read"
}

  key_prefix "" {
    policy = "read"
  }

  node_prefix "" {
    policy = "read"
  }

  service_prefix "" {
    policy = "write"
  }

  acl = "read"

OR By new standards after 1.7x, if I will have to configure workload identity, what changes do I need to make for my Consul and Nomad config as well as job spec?

I truly appreciate your help in it.. Without our Nomad jobs running, we are completey lost right now.

Hi @andy-22,

I would suggest taking a look through our Consul identity tutorial which includes details of all the items you’ll need to ensure are present for identities to work. In particular, configuration items such as Consul auth-methods and Consul binding-rules will be required, which you may not already have.

You will need to add a Consul block to your job specifications that need a Consul identity; in the task block a consul {} declaration should be OK looking at your setup.

Thanks,
jrasell and the Nomad team

@jrasell,
Finally, after your pointer to interactive tutorial, I made my cluster alive back again.
I also have gained insight into Workload identities.
That one link was a game changer.

I really really appreciate your help.