I have two tasks running in the same group. One is the front end (mlflow ui) and the other is the backend (mlflow db).
Never mind the commented “–backend-store-uri”: I get into a bash shell of the “mlflow” task and try to connect to the database using the $NOMAD_ADDR_mlflow_db
variable:
root@5ce82f892731:/# mlflow server --default-artifact-root /home --backend-store-uri postgresql://$NOMAD_ADDR_mlflow_db
2021/11/28 12:16:47 WARNING mlflow.store.db.utils: SQLAlchemy engine could not be created. The following exception is caught.
(psycopg2.OperationalError) connection to server at "127.0.0.1", port 24788 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
(Background on this error at: https://sqlalche.me/e/14/e3q8)
I checked that the psql server is running, but why isn’t it accepting TCP connections?
As I am a noob, I’d like to ask if this is even the best way to do this, or whether I should split the frontend and the backend into different group
s.
mlflow.hcp
job "mlflow-test" {
datacenters = ["dc1"] # default when running dev mode
type = "service"
group "mlflow_group" {
count = 1
network {
port "mlflow_ui" {}
port "mlflow_db" {}
}
task "mlflow" {
driver = "docker"
config {
image = "bgalvao/nomad-mlflow"
ports = ["mlflow_ui"]
# entrypoint = ["bash"]
entrypoint = ["mlflow", "server"]
args = [
"--host", "0.0.0.0",
"-p", "${NOMAD_PORT_mlflow_ui}",
# # "--backend-store-uri", "postgresql://postgres@${NOMAD_ADDR_mlflow_db}/postgres"
]
}
resources {
cpu = 2000
memory = 2000
}
}
task "mlflow-db" {
driver = "docker"
config {
image = "postgres" # https://hub.docker.com/_/postgres
ports = ["mlflow_db"]
}
env {
POSTGRES_PASSWORD = "use_vault"
# psql --username=spec_user -d mlflow_db
# postgresql://spec_user:use_vault@localhost:${NOMAD_PORT_mlflow_db}/mlflow_db
# for debugging purposes
}
resources {
cpu = 2000
memory = 2000
}
lifecycle {
hook = "prestart"
# set sidecar = true
# if you want the job to run for the duration of
# the allocation
sidecar = true
}
}
}
}