Is there a cache volume on Nomad clients?

Hi,
I’m experiencing an unexpected behavior on shared volumes.
We have :

  • A Nomad cluster (Version 1.1.1) with 3 master and N clients
  • A GlusterFS cluster with 3 nodes (up to date)
  • Jobs with 3 tasks (an HTTP API, a webserver and a PostGreSQL database)
    We mount a GlusterFS client on each Nomad client in order to share the volume to database container (and ensure stateful workload).
    Clients are also configured to use this volume :
# Nomad client configuration
client {
    enabled = true
    servers = ["..."]
  
    node_class = ""
    no_host_uuid = false
  
    max_kill_timeout = "30s"
    network_speed = 0
    cpu_total_compute = 0
    gc_interval = "1m"
    gc_disk_usage_threshold = 80
    gc_inode_usage_threshold = 70
    gc_parallel_destroys = 2
    reserved {
        cpu = 0
        memory = 0
        disk = 0
    }
    host_volume "my-data" {
        path = "/path/to/gluster/volume"
        read_only = false
    }
}
plugin "docker" {
    config {
        endpoint = "unix:///var/run/docker.sock"
        volumes {
            enabled = true
        }
    }
}
// Block Storage over GlusterFS
job "app" {
    datacenters = ["dc1"]
    type = "service"
            
    // Create a group that contains tasks
    group "app" {
        count = 1
        volume "app-volume-name" {
            type = "host"
            read_only = false
            source = "my-data"
        }
        
        //
        // Database task
        //
        task "database"  {
            lifecycle {
                hook = "prestart"
                sidecar = true
            }
            volume_mount {
                volume = "app-volume-name"
                destination = "/path/to/mounting/point"
                read_only = false
            }
            // The driver used for task runner
            driver = "docker"
            kill_signal = "SIGTERM"
            kill_timeout = "300s"
            // Define driver config
            config {
                image = "path.to/database/image:latest"
                force_pull = true
                port_map {
                    db = 5432
                }
                args = [
                    "postgres",
                    "-D", "/path/to/data",
                    "-c", "... Postgres tuning",
                ]
            }
            env {
                POSTGRES_DB = "database"
                POSTGRES_USER = "username"
                POSTGRES_PASSWORD = "password"
                PGDATA = "/path/to/data"
            }
            resources {
                cpu    = 500 # 500 MHz
                memory = 256 # 256MB
                network {
                    mbits = 5
                    // In network tell Nomad to open a random port for http port for each container
                    port "db" {}
                }
            }
            // service stanza is used by Consul auto discovery feature
            // It ONLY DESCRIBE the service
            service {
                name = "database-name"
                tags = [
                    "global",
                    "report",
                    "database"
                ]
                address_mode = "host"
                port = "db"
                check {
                    name     = "alive"
                    interval = "5s"
                    type     = "tcp"
                    timeout  = "5s"
                    port     = "db"
                }
            }
        }
        //
        // Web server task
        //
        task "webserver" {
            ...
        }
        //
        // API task
        //
        task "api" {
            ...
        }
    }
}

Here is an example of what we’re experiencing, we can imagine we’re following a task named X. Please note we are in a cloud environment (ordering and returning servers frequently).
The scenario :
STEP 1) A first scaling is performed, so we can imagine servers A, B and C are ordered and added to nomad cluster. My task X go on host B, everything works fine
STEP 2) A second scaling phase is performed and servers D and E are added to my Nomad cluster, my task X go on host D, everything works fine
STEP 3) A third scaling phase is performed and my task X go back to host B, my database don’t boot anymore
When my task go back on it’s initial host, the database system don’t boot and dump following logs :

2021-06-28 14:26:11.836 UTC [1] LOG:  starting PostgreSQL 12.7 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20210424) 10.3.1 20210424, 64-bit
2021-06-28 14:26:11.836 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-06-28 14:26:11.836 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2021-06-28 14:26:11.868 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-06-28 14:26:12.117 UTC [21] LOG:  database system was shut down at 2021-06-24 18:07:24 UTC
2021-06-28 14:26:12.131 UTC [21] LOG:  invalid primary checkpoint record
2021-06-28 14:26:12.131 UTC [21] PANIC:  could not locate a valid checkpoint record
2021-06-28 14:26:22.006 UTC [24] FATAL:  the database system is starting up
2021-06-28 14:26:22.011 UTC [25] FATAL:  the database system is starting up
2021-06-28 14:26:23.232 UTC [26] FATAL:  the database system is starting up
2021-06-28 14:26:23.235 UTC [27] FATAL:  the database system is starting up
2021-06-28 14:26:24.462 UTC [28] FATAL:  the database system is starting up
2021-06-28 14:26:24.467 UTC [29] FATAL:  the database system is starting up
2021-06-28 14:26:25.693 UTC [31] FATAL:  the database system is starting up
2021-06-28 14:26:25.697 UTC [32] FATAL:  the database system is starting up
2021-06-28 14:26:26.881 UTC [33] FATAL:  the database system is starting up
2021-06-28 14:26:26.887 UTC [34] FATAL:  the database system is starting up
2021-06-28 14:26:28.097 UTC [35] FATAL:  the database system is starting up
2021-06-28 14:26:28.104 UTC [36] FATAL:  the database system is starting up
2021-06-28 14:26:29.295 UTC [37] FATAL:  the database system is starting up
2021-06-28 14:26:29.298 UTC [38] FATAL:  the database system is starting up
2021-06-28 14:26:30.510 UTC [40] FATAL:  the database system is starting up
[...]
2021-06-28 14:28:09.707 UTC [226] FATAL:  the database system is starting up
2021-06-28 14:28:10.914 UTC [228] FATAL:  the database system is starting up
2021-06-28 14:28:10.917 UTC [229] FATAL:  the database system is starting up
2021-06-28 14:28:11.261 UTC [1] LOG:  startup process (PID 21) was terminated by signal 6: Aborted
2021-06-28 14:28:11.261 UTC [1] LOG:  aborting startup due to startup process failure
2021-06-28 14:28:11.311 UTC [1] LOG:  database system is shut down

As we can see the SGBD last shutdown is at 2021-06-24 18:07:24, this information is invalid because SGBD has been restarted many times since.

This date match the date where my task X leave the server B (STEP 2 of the scenario), it seem that Nomad (or Docker) keep “ghost” data on host that corrupt the file system.
If I rebalance my job on an other node everything works fine again…
Is there a cache volume on Nomad clients I don’t know ?

Thank you in advance for your answers
Corentin

Hi @corentin.m :wave:

No, there’s no cache volume in Nomad. There is the allocation directory that is store within Nomad’s data_dir path, but, from what you described, the second time your task starts in the client B it should’ve been a new allocation.

But you are using a host volume, so the data there will be persisted across allocations. I don’t know enough about Postgres, but a quick Google for PANIC: could not locate a valid checkpoint record indicates that your last instance didn’t shutdown correctly and the checkpoint data was not persisted properly.

Could you check if your database is being shutdown cleanly?

To remove Nomad from the picture, could you also try starting a Postgres server pointing to that data path outside Nomad? Like running postgres -D /path/to/data directly in the host CLI.

Hi @lgfa29 ,
First, thank you to take a look to our issue.

Incorrect shut down was our first solution, but booting PostgreSQL outside Nomad layer works fine (without any recovery log entry). Furthermore according to PostgreSQL documentation a smart shutdown is performed and we’ve configured Nomad job to avoid parrallel update to ensure database filesystem is accessed by only one database server at once.
In addition when task X start on an other host, the SGBD boot without any problem. The problem is specifically present when job X returns to B.

This is why we’re looking for a clue at Nomad (and Docker) level.

Thank you for checking that. If you are able to start Postgres outside Nomad pointing to the same data then there might be something in the alloc dir.

Are you using sticky disk by any chance? And when the allocation returns to a host, could you check if it starts with the same ID.

No, sticky_disk is not set (so false) and I confirm that allocation ID is not the same between restarts.

In addition, we’ve tried other PostgreSQL images, without results.
Then we’ve booted SGBD through Docker (without Nomad layer) and it works.

Is there is any data which are kept between allocations ?

Thank you for your help

No, there is not data shared between allocations, except for the host volume, so I suspect something in the volume may be causing the issue?

Did you point to the same path of the host volume? And after a Nomad allocation completed?