Java job error="rpc error: code = Unavailable desc = transport is closing"

Hi, I’m using the Nomad official binary package version 0.10.2 on CentOS 7. I’ve installed the Java Oracle 8 on the path /opt/java and all the environment variables are OK.
To test the Nomad java driver, I created a simple Java app which only prints the date and time every second on a loop. Then I started a nomad server and client with the following configurations:

datacenter = "my_center"
data_dir = "/opt/nomad/data1"
addresses {
  http = "0.0.0.0"
  rpc  = "0.0.0.0"
  serf = "0.0.0.0"
}
ports {
  http = 4646
  rpc  = 4647
  serf = 4648
}
server {
  enabled = true
  bootstrap_expect = 1
}


client {
  enabled = true
  server_join {
    retry_join = ["192.168.0.14:4647"]
    retry_max = 3
    retry_interval = "15s"
  }
  reserved {
    cpu = 500
    memory = 1024
    disk = 1024
  }
 chroot_env {
    "/opt/java"           = "/opt/java"
    "/etc/passwd"         = "/etc/passwd"
  }
}

Then I uploaded the jar file in a upload center and create the following nomad job file:

    job "Java-Test" {
      datacenters = ["my_center"]
   
group "test-group" {
    count = 1
	constraint {
  operator  = "distinct_hosts"
  value     = "true"
	}


task "java-task" {	
  driver = "java"
  config {
    jar_path    = "test.jar"
    jvm_options = ["-Xmx2048m", "-Xms256m"]
  }

   artifact {
    source = "https://transfer.sh/tvlnY/NomadJavaTest.jar"
     }
   }
 }
}

But when I submit the job to the Nomad cluster, I got the following error:

    2019-12-21T03:10:27.055+0100 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=b52c55bc-e248-602a-0a39-97d51508b04f task=java-task @module=logmon path=/opt/nomad/data1/alloc/b52c55bc-e248-602a-0a39-97d51508b04f/alloc/logs/.java-task.stdout.fifo timestamp=2019-12-21T03:10:27.055+0100
    2019-12-21T03:10:27.055+0100 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=b52c55bc-e248-602a-0a39-97d51508b04f task=java-task @module=logmon path=/opt/nomad/data1/alloc/b52c55bc-e248-602a-0a39-97d51508b04f/alloc/logs/.java-task.stderr.fifo timestamp=2019-12-21T03:10:27.055+0100
    2019-12-21T03:10:32.949+0100 [INFO]  client.driver_mgr.java: starting java task: driver=java driver_cfg="{Class: ClassPath: JarPath:test.jar JvmOpts:[-Xmx2048m -Xms256m] Args:[]}" args=[-Xmx2048m, -Xms256m, -jar, test.jar]
    2019-12-21T03:10:33.348+0100 [ERROR] client.driver_mgr.java: error receiving stream from Stats executor RPC, closing stream: alloc_id=b52c55bc-e248-602a-0a39-97d51508b04f driver=java task_name=java-task error="rpc error: code = Unavailable desc = transport is closing"
    2019-12-21T03:10:33.348+0100 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task: alloc_id=b52c55bc-e248-602a-0a39-97d51508b04f task=java-task error="rpc error: code = Canceled desc = grpc: the client connection is closing"
    2019-12-21T03:10:33.349+0100 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=b52c55bc-e248-602a-0a39-97d51508b04f task=java-task reason="Restart within policy" delay=16.792958146s

It seems the error is related to the gRPC. I have not the Go language installed on my system, but I don’t think it’s needed for Nomad.
The test jar file is available on the so-called link. It can be used to reproduce the error.
Could you please help me to solve this?

The gRPC error you’re seeing there is from the communication between Nomad and the Java task driver (which is internal to Nomad; we use gRPC for both internal and external plugins). So you’re right, you don’t need go installed on the system at all.

There’s a couple things I’d check here:

  • switch to debug-level logging. that will usually provide some more information.
  • run nomad node status -verbose <node ID> and make sure that the driver.java attributes are what you expect to see. For example, on one of my machines they look like:
driver.java                   = 1
driver.java.runtime           = OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
driver.java.version           = 1.8.0_222
driver.java.vm                = OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)

I am getting similar error :
Feb 4 11:52:07 ip-172-31-18-170 nomad[30454]: 2020-02-04T00:52:07.649Z [ERROR] client.driver_mgr.exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=326a1109-b644-5f8e-fccc-afdc2a30c2bb driver=exec task_name=driver error=“rpc error: code = Unavailable desc = transport is closing”
Feb 4 11:52:07 ip-172-31-18-170 nomad[30454]: 2020-02-04T00:52:07.649Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task: alloc_id=326a1109-b644-5f8e-fccc-afdc2a30c2bb task=driver error=“rpc error: code = Canceled desc = grpc: the client connection is closing”

I am trying to run spark jobs using nomad. But the driver kills the executor when it gets this error. Any suggestions how to fix this ?

I checked my Java driver properties, looks similar to yours, just have updated java 8 version

driver.java = 1
driver.java.runtime = OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~16.04-b08)
driver.java.version = 1.8.0_242
driver.java.vm = OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)