Hi,
I’m experiencing an unexpected behavior on shared volumes.
We have :
- A Nomad cluster (Version 1.1.1) with 3 master and N clients
- A GlusterFS cluster with 3 nodes (up to date)
- Jobs with 3 tasks (an HTTP API, a webserver and a PostGreSQL database)
We mount a GlusterFS client on each Nomad client in order to share the volume to database container (and ensure stateful workload).
Clients are also configured to use this volume :
# Nomad client configuration
client {
enabled = true
servers = ["..."]
node_class = ""
no_host_uuid = false
max_kill_timeout = "30s"
network_speed = 0
cpu_total_compute = 0
gc_interval = "1m"
gc_disk_usage_threshold = 80
gc_inode_usage_threshold = 70
gc_parallel_destroys = 2
reserved {
cpu = 0
memory = 0
disk = 0
}
host_volume "my-data" {
path = "/path/to/gluster/volume"
read_only = false
}
}
plugin "docker" {
config {
endpoint = "unix:///var/run/docker.sock"
volumes {
enabled = true
}
}
}
// Block Storage over GlusterFS
job "app" {
datacenters = ["dc1"]
type = "service"
// Create a group that contains tasks
group "app" {
count = 1
volume "app-volume-name" {
type = "host"
read_only = false
source = "my-data"
}
//
// Database task
//
task "database" {
lifecycle {
hook = "prestart"
sidecar = true
}
volume_mount {
volume = "app-volume-name"
destination = "/path/to/mounting/point"
read_only = false
}
// The driver used for task runner
driver = "docker"
kill_signal = "SIGTERM"
kill_timeout = "300s"
// Define driver config
config {
image = "path.to/database/image:latest"
force_pull = true
port_map {
db = 5432
}
args = [
"postgres",
"-D", "/path/to/data",
"-c", "... Postgres tuning",
]
}
env {
POSTGRES_DB = "database"
POSTGRES_USER = "username"
POSTGRES_PASSWORD = "password"
PGDATA = "/path/to/data"
}
resources {
cpu = 500 # 500 MHz
memory = 256 # 256MB
network {
mbits = 5
// In network tell Nomad to open a random port for http port for each container
port "db" {}
}
}
// service stanza is used by Consul auto discovery feature
// It ONLY DESCRIBE the service
service {
name = "database-name"
tags = [
"global",
"report",
"database"
]
address_mode = "host"
port = "db"
check {
name = "alive"
interval = "5s"
type = "tcp"
timeout = "5s"
port = "db"
}
}
}
//
// Web server task
//
task "webserver" {
...
}
//
// API task
//
task "api" {
...
}
}
}
Here is an example of what we’re experiencing, we can imagine we’re following a task named X. Please note we are in a cloud environment (ordering and returning servers frequently).
The scenario :
STEP 1) A first scaling is performed, so we can imagine servers A, B and C are ordered and added to nomad cluster. My task X go on host B, everything works fine
STEP 2) A second scaling phase is performed and servers D and E are added to my Nomad cluster, my task X go on host D, everything works fine
STEP 3) A third scaling phase is performed and my task X go back to host B, my database don’t boot anymore
When my task go back on it’s initial host, the database system don’t boot and dump following logs :
2021-06-28 14:26:11.836 UTC [1] LOG: starting PostgreSQL 12.7 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20210424) 10.3.1 20210424, 64-bit
2021-06-28 14:26:11.836 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2021-06-28 14:26:11.836 UTC [1] LOG: listening on IPv6 address "::", port 5432
2021-06-28 14:26:11.868 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-06-28 14:26:12.117 UTC [21] LOG: database system was shut down at 2021-06-24 18:07:24 UTC
2021-06-28 14:26:12.131 UTC [21] LOG: invalid primary checkpoint record
2021-06-28 14:26:12.131 UTC [21] PANIC: could not locate a valid checkpoint record
2021-06-28 14:26:22.006 UTC [24] FATAL: the database system is starting up
2021-06-28 14:26:22.011 UTC [25] FATAL: the database system is starting up
2021-06-28 14:26:23.232 UTC [26] FATAL: the database system is starting up
2021-06-28 14:26:23.235 UTC [27] FATAL: the database system is starting up
2021-06-28 14:26:24.462 UTC [28] FATAL: the database system is starting up
2021-06-28 14:26:24.467 UTC [29] FATAL: the database system is starting up
2021-06-28 14:26:25.693 UTC [31] FATAL: the database system is starting up
2021-06-28 14:26:25.697 UTC [32] FATAL: the database system is starting up
2021-06-28 14:26:26.881 UTC [33] FATAL: the database system is starting up
2021-06-28 14:26:26.887 UTC [34] FATAL: the database system is starting up
2021-06-28 14:26:28.097 UTC [35] FATAL: the database system is starting up
2021-06-28 14:26:28.104 UTC [36] FATAL: the database system is starting up
2021-06-28 14:26:29.295 UTC [37] FATAL: the database system is starting up
2021-06-28 14:26:29.298 UTC [38] FATAL: the database system is starting up
2021-06-28 14:26:30.510 UTC [40] FATAL: the database system is starting up
[...]
2021-06-28 14:28:09.707 UTC [226] FATAL: the database system is starting up
2021-06-28 14:28:10.914 UTC [228] FATAL: the database system is starting up
2021-06-28 14:28:10.917 UTC [229] FATAL: the database system is starting up
2021-06-28 14:28:11.261 UTC [1] LOG: startup process (PID 21) was terminated by signal 6: Aborted
2021-06-28 14:28:11.261 UTC [1] LOG: aborting startup due to startup process failure
2021-06-28 14:28:11.311 UTC [1] LOG: database system is shut down
As we can see the SGBD last shutdown is at 2021-06-24 18:07:24, this information is invalid because SGBD has been restarted many times since.
This date match the date where my task X leave the server B (STEP 2 of the scenario), it seem that Nomad (or Docker) keep “ghost” data on host that corrupt the file system.
If I rebalance my job on an other node everything works fine again…
Is there a cache volume on Nomad clients I don’t know ?
Thank you in advance for your answers
Corentin