I have a brand new cluster and I am able to run Python jobs without issue using raw_exec and I am also able to use some basic java program using raw_exec but I am not sure how to get the Java driver to work . Or is that even needed ? Can someone point me to a good doc on what needs to be done in order get JAVA driver working on both Nomad Server and Client config
Nomad Server Config
plugin "raw_exec" {
config {
enabled = true
no_cgroups = true
}
}
Client
plugin "raw_exec" {
config {
enabled = true
no_cgroups = true
}
options {
"driver.allowlist" = "exec,java,raw_exec"
}
However If use a simple Java program with raw_exec it works fine but if I switch from
driver = "raw_exec"
config {
command = "/apps/bin/java"
args = ["-jar", "/tmp/xxxxx.jar"]
}
to
driver = "java"
config {
command = "/apps/bin/java"
args = ["-jar", "/tmp/xxxxx.jar"]
}
it fails with error when run and plan it from the UI
Constraint missing drivers filtered 1 node
How do we get JAVA driver to work ? I even tried adding the jar_path and jvm_options and artifact but still doesnt work .
Also output nomad node status <client_id>
nomad node status d3f254c1
ID = xxxxx
Name = xxx
Class = <none>
DC = xxx
Drain = false
Eligibility = eligible
Status = ready
CSI Controllers = <none>
CSI Drivers = <none>
Uptime = 164h55m51s
Host Volumes = <none>
Host Networks = <none>
CSI Volumes = <none>
Driver Status = raw_exec
See the java task driver client requirements – can you share the part of the client configuration (nomad.hcl
) where you enable the java driver?
Hi @sammy676776, if you show the output of nomad node status -verbose <id>
there should be a Drivers section showing which drivers have or have not been detected.
If this is a Linux machine, then I suspect the problem is that your java install is not in the default chroot. You’ll need to add /apps/bin/java
to the chroot_env
on the Client before it can be detected.
Hi!
I wrote a Medium article a little while back about the using the Java driver with Nomad. Check it out and see if this helps. I have a few full examples in there.
Here are some key points:
- for the Nomad client setup, you just need to install Java on it (I used OpenJDK11)
- jar is passed to the Java driver via
jar_path
- you need to use the
artifact
stanza to obtain your .jar file
My nomad client stanza to invoke Java
client {
enabled = true
gc_max_allocs = 100
gc_interval = "10m"
server_join {
retry_join = ["xxxxx:4647", "xxxxx:4647", "xxxx:4647" ]
}
options {
"driver.allowlist" = "exec,java,raw_exec"
}
chroot_env {
"/bin" = "/bin"
"/etc" = "/etc"
"/lib" = "/lib"
"/lib32" = "/lib32"
"/lib64" = "/lib64"
"/run/resolvconf" = "/run/resolvconf"
"/sbin" = "/sbin"
"/usr" = "/usr"
"/apps/bin/java/" = "/apps/bin/java/"
}
}
plugin "raw_exec" {
config {
enabled = true
no_cgroups = true
}
}
Please note when nomad comes up it does show this line
[INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0
nomad node status -verbose <nodeid> output for drivers which shows that Java is not healthy for some reason .
Driver Detected Healthy Message Time
exec true true Healthy 2023-03-09T15:25:08-05:00
java false false <none> 2023-03-09T15:25:08-05:00
raw_exec true true Healthy 2023-03-09T15:25:08-05:00
So JAVA still not available .
ok FIXED …basically the JAVA_HOME and JAVA_PATH were not being picked up even after defining it in unit files . Once I passed it where it involes the “/bin/nomad agent -config” it started working .
Drivers
Driver Detected Healthy Message Time
exec true true Healthy 2023-03-09T15:56:51-05:00
java true true Healthy 2023-03-09T15:56:51-05:00
raw_exec true true Healthy 2023-03-09T15:56:51-05:00
However this brings back another old problem which we fixed by not running as ROOT
…which is when we run any task not on this client I get
client.alloc_runner.task_runner: prestart failed: alloc_id=93b9290e-a652-8378-f042-86f6bb4f099d task=webservice error="prestart hook \"task_dir\" failed: Failed to mount shared directory for task: operation not permitted"
2023-03-09T15:59:40.060-0500 [INFO] client.alloc_runner.task_runner: not restarting task: alloc_id=93b9290e-a652-8378-f042-86f6bb4f099d task=webservice reason="Error was unrecoverable"
Which directory is it actually complaining about as it is running as root and entire directory structure is also owned by root
stat nomad_xxxx
evice: 29h/41d Inode: 6462200812 Links: 4
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
The Nomad Client needs to run as root. The directory structure of a Nomad Client in a production environment should look like,
[drwxr-xr-x root ] /opt/nomad
[drwx------ root ] ├── data
[drwx--x--x root ] │ ├── alloc
[drwx------ root ] │ └── client
[drwxr-xr-x root ] └── plugins
If you try to run the Client as non-root (which isn’t supported, but can be done with caveats and tweaks), then of course that structure would need different permissions.
Thanks @seth.hoenig . Right now I am running as ROOT and entire directory structure is root and as soon as I did that my JAVA driver issue got resolved however I still run into this task_dir problem for every job .
lient.alloc_runner.task_runner: prestart failed: alloc_id=93b9290e-a652-8378-f042-86f6bb4f099d task=webservice error="prestart hook \"task_dir\" failed: Failed to mount shared directory for task: operation not permitted"
2023-03-09T15:59:40.060-0500 [INFO] client.alloc_runner.task_runner: not restarting task: alloc_id=93b9290e-a652-8378-f042-86f6bb4f099d task=webservice reason="Error was unrecoverable"
@seth.hoenig Thank you …I am getting close …switching "data_dir = “/opt/nomad” made it work …fingers crossed …lets see
Thanks for all the help so far @seth.hoenig @Neutrollized @brucellino1 .I hope my different issues will help someone in future . I still have one more issue left with this Java .I have a simple java program that write something to a file so I know when this job is successful . However raw_exec works and JAVA fails with Exit Code 127 .Any ideas? Same jar, file same hosts but diff results with diff driver
Java driver is loading fine on node
Drivers
Driver Detected Healthy Message Time
exec true true Healthy 2023-03-09T19:01:22-05:00
java true true Healthy 2023-03-09T19:01:22-05:00
raw_exec true true Healthy 2023-03-09T19:01:22-05:00
If I use Java Driver I get the following error and job doesn’t compelte . Any ideas ?
nomad alloc status d43c54e1
ID = d43c54e1-f27a-d83a-d235-7cf2ba806337
Eval ID = f67d9585
Name = xxxxxx.cache[0]
Node ID = 92628112
Node Name = redacted
Job ID = xxxxxxx
Job Version = 0
Client Status = pending
Client Description = No tasks have started
Desired Status = run
Desired Description = <none>
Created = 1m ago
Modified = 15s ago
Task "webservice" is "pending"
Task Resources
CPU Memory Disk Addresses
0/100 MHz 0 B/300 MiB 300 MiB
Task Events:
Started At = 2023-03-11T21:01:59Z
Finished At = N/A
Total Restarts = 3
Last Restart = 2023-03-11T16:01:59-05:00
Recent Events:
Time Type Description
2023-03-11T16:01:59-05:00 Restarting Task restarting in 15.319076743s
2023-03-11T16:01:59-05:00 Terminated Exit Code: 127
2023-03-11T16:01:59-05:00 Started Task started by client
2023-03-11T16:01:38-05:00 Restarting Task restarting in 17.944166401s
2023-03-11T16:01:38-05:00 Terminated Exit Code: 127
2023-03-11T16:01:38-05:00 Started Task started by client
2023-03-11T16:01:36-05:00 Restarting Task restarting in 17.543176045s
2023-03-11T16:01:36-05:00 Terminated Exit Code: 127
2023-03-11T16:01:36-05:00 Started Task started by client
2023-03-11T16:01:33-05:00 Downloading Artifacts Client is downloading artifacts
Working raw_exec code
driver = "raw_exec"
config {
command = "/apps/java/bin/java"
args = ["-jar", "/tmp/simple.jar"]
}
}
Failing JAVA driver
task "webservice" {
driver = "java"
config {
jar_path = "/tmp/simple.jar"
jvm_options = ["-Xmx2048m", "-Xms256m"]
}
artifact {
source = "https://xxxxx:4444/simple.jar"
}
}
To summarize same hosts same jar file raw_exec works and JAVA fails with Exit Code 127 .
Is there a way to make the node pick up the right java while using JAVA driver in the job ? There are multiple flavors of java and maybe we need to explicitly mention that in the job ?
2023-03-13T12:42:33-04:00 Not Restarting Exceeded allowed attempts 3 in interval 24h0m0s and mode is "fail"
2023-03-13T12:42:33-04:00 Terminated Exit Code: 127
2023-03-13T12:42:33-04:00 Started Task started by client
2023-03-13T12:42:12-04:00 Restarting Task restarting in 18.411032403s
2023-03-13T12:42:12-04:00 Terminated Exit Code: 127
2023-03-13T12:42:12-04:00 Started Task started by client
2023-03-13T12:41:54-04:00 Restarting Task restarting in 15.690231165s
2023-03-13T12:41:54-04:00 Terminated Exit Code: 127
2023-03-13T12:41:54-04:00 Started Task started by client
2023-03-13T12:41:51-04:00 Restarting Task restarting in 17.910123115s
@sammy676776 those are Task Events, I’m asking about the logs generated by the task itself
I dont see any logs
nomad alloc logs -job just hangs and if I try
nomad alloc logs -job
So I went to “/opt/nomad/alloc/e4021037-ae67-1a6c-67da-611c2e18154e/webservice/alloc/logs”
more webservice.stderr.0
/apps/jdk-11.0.18/bin/java: error while loading shared libraries: libjli.so: cannot open shared object file: No such file or directory
/apps/jdk-11.0.18/bin/java: error while loading shared libraries: libjli.so: cannot open shared object file: No such file or directory
do you think the classpath is not being picked up ?
task "webservice" {
driver = "java"
config {
jar_path = "local/blah.jar"
jvm_options = ["-Xmx2048m", "-Xms256m"]
class_path = "/apps/jdk-11.0.18/"
}
Can you remove the /tmp
in the jar_path
field and just have the jar file name in there?
i.e.
task "webservice" {
driver = "java"
config {
jar_path = "simple.jar"
jvm_options = ["-Xmx2048m", "-Xms256m"]
}
artifact {
source = "https://xxxxx:4444/simple.jar"
mode = "file"
}
}
If you normally don’t need to pass in a classpath when you run it, then you shouldn’t need to specify it in your jobspec either.
Tried that too . Taking out “chroot” and putting it back in gives diff errors . Current error
Driver Failure failed to launch command with executor: rpc error: code = Unknown desc = file /apps/jdk-11.0.18/bin/java not found under path /opt/nomad/alloc/a23ac027-0c43-08bb-a02f-4fa4d0f3a5b7/raw
nomad node status nodeid shows JAVA driver is fine and loaded
Drivers
Driver Detected Healthy Message Time
exec true true Healthy 2023-03-14T14:59:30-04:00
java true true Healthy 2023-03-14T14:59:30-04:00
raw_exec true true Healthy 2023-03-14T14:59:30-04:00
chroot_env {
"/bin" = "/bin"
"/etc" = "/etc"
"/lib" = "/lib"
"/lib32" = "/lib32"
"/lib64" = "/lib64"
"/run/resolvconf" = "/run/resolvconf"
"/sbin" = "/sbin"
"/usr" = "/usr"
}
It is something or the other with JAVA driver. The same job with some changes works with RAW_EXEC . At a point where I am questioning if there is any benefit to run with JAVA driver instead of RAW_EXEC which works and probably more used than the others .
I am going to use raw_exec and give up on JAVA driver as it is quite unstable in our environment and already spent so much time trying to set it up ! . I am happy to work with any Hashicorp or other folks in this group if they are willing to help fix this issue for other customers as I can reproduce this JAVA driver failure quite easily with couple of simple jar files .