Missing driver with Docker TLS

Bonjour, :wave:

Today! Missing drivers…

Nomad v1.5.3 (single node)
Ubuntu 22.04
Docker version 23.0.3, build 3e7cbfd

I switch Docker to TCP listenner with TLS.

cat /etc/systemd/system/docker.service.d/override.conf

[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -D --tlsverify --tlscacert=/etc/ssl/docker/docker-ca.pem --tlscert=/etc/ssl/docker/dc1-server-docker.pem --tlskey=/etc/ssl/docker/dc1-server-docker.key -H tcp://0.0.0.0:2376
root@anita:/etc/nomad.d#

I configure docker.hcl

plugin "docker" {
    config {
        endpoint = "tcp://127.0.0.1:2376"

        tls {
          cert = "/etc/ssl/docker/dc1-client-docker.pem"
          key  = "/etc/ssl/docker/dc1-client-docker.key"
          ca   = "/etc/ssl/docker/docker-ca.pem"
        }

        allow_privileged = false

        volumes {
            enabled = true
        }

        gc {
        image       = true
        image_delay = "1h"
        container   = true

        dangling_containers {
            enabled        = true
            dry_run        = false
            period         = "5m"
            creation_grace = "5m"
            }
        }
    }
}

And part of nomad.hcl

bind_addr = "0.0.0.0"
advertise {
    http = "127.0.0.1:4646"
    rpc = "127.0.0.1:4647"
    serf = "127.0.0.1:4648"
}
ports {
    http = 4646
    rpc = 4647
    serf = 4648
}

When I run a job:

    Constraint missing drivers filtered 1 node

And debug log

2023-04-05T12:20:01.973Z [DEBUG] http: request complete: method=GET path=/v1/job/mosquitto duration=1.128492ms
2023-04-05T12:20:01.996Z [DEBUG] worker: dequeued evaluation: worker_id=55dbf143-4263-c2ce-b37f-566c7e3fcecc eval_id=71cb85ad-975b-fe96-4cd3-26be41fa2c9e type=service namespace=default job_id=mosquitto node_id="" triggered_by=job-register
2023-04-05T12:20:01.997Z [DEBUG] http: request complete: method=GET path=/v1/job/mosquitto?index=18572 duration=33.846121442s
2023-04-05T12:20:01.997Z [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=71cb85ad-975b-fe96-4cd3-26be41fa2c9e job_id=mosquitto namespace=default worker_id=55dbf143-4263-c2ce-b37f-566c7e3fcecc
  results=
  | Total changes: (place 1) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0)
  | Created Deployment: "514994ee-bf40-1840-dde0-b8db69b0626d"
  | Desired Changes for "mosquitto": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)

2023-04-05T12:20:01.997Z [DEBUG] http: request complete: method=GET path=/v1/job/mosquitto/evaluations?index=18572 duration=33.853779796s
2023-04-05T12:20:01.997Z [DEBUG] http: request complete: method=POST path=/v1/job/mosquitto duration=12.310982ms
2023-04-05T12:20:02.006Z [DEBUG] worker: created evaluation: worker_id=55dbf143-4263-c2ce-b37f-566c7e3fcecc eval="<Eval \"1f6b20db-9f4e-3c66-50f7-b117074abdbc\" JobID: \"mosquitto\" Namespace: \"default\">"
2023-04-05T12:20:02.006Z [DEBUG] worker.service_sched: failed to place all allocations, blocked eval created: eval_id=71cb85ad-975b-fe96-4cd3-26be41fa2c9e job_id=mosquitto namespace=default worker_id=55dbf143-4263-c2ce-b37f-566c7e3fcecc blocked_eval_id=1f6b20db-9f4e-3c66-50f7-b117074abdbc
2023-04-05T12:20:02.015Z [DEBUG] worker: submitted plan for evaluation: worker_id=55dbf143-4263-c2ce-b37f-566c7e3fcecc eval_id=71cb85ad-975b-fe96-4cd3-26be41fa2c9e
2023-04-05T12:20:02.015Z [DEBUG] worker.service_sched: setting eval status: eval_id=71cb85ad-975b-fe96-4cd3-26be41fa2c9e job_id=mosquitto namespace=default worker_id=55dbf143-4263-c2ce-b37f-566c7e3fcecc status=complete
2023-04-05T12:20:02.015Z [DEBUG] http: request complete: method=GET path=/v1/job/mosquitto/deployment?index=18571 duration=35.859597217s
2023-04-05T12:20:02.019Z [DEBUG] http: request complete: method=GET path=/v1/job/mosquitto/evaluations?index=18573 duration="954.093µs"
2023-04-05T12:20:02.024Z [DEBUG] worker: updated evaluation: worker_id=55dbf143-4263-c2ce-b37f-566c7e3fcecc eval="<Eval \"71cb85ad-975b-fe96-4cd3-26be41fa2c9e\" JobID: \"mosquitto\" Namespace: \"default\">"
2023-04-05T12:20:02.024Z [DEBUG] worker: ack evaluation: worker_id=55dbf143-4263-c2ce-b37f-566c7e3fcecc eval_id=71cb85ad-975b-fe96-4cd3-26be41fa2c9e type=service namespace=default job_id=mosquitto node_id="" triggered_by=job-register
2023-04-05T12:20:02.024Z [DEBUG] http: request complete: method=GET path=/v1/job/mosquitto/summary?index=18572 duration=35.862729065s
2023-04-05T12:20:04.030Z [DEBUG] http: request complete: method=GET path=/v1/job/mosquitto/evaluations?index=18575 duration=1.699689ms
2023-04-05T12:20:07.187Z [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:48994
2023-04-05T12:20:07.606Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=3.099814ms

log INFO from startup Nomad (very long to start):

tail -f /var/log/nomad/nomad.log
2023-04-05T16:00:13.495Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2023-04-05T16:00:13.495Z [INFO]  nomad.raft: entering candidate state: node="Node at 127.0.0.1:4647 [Candidate]" term=3
2023-04-05T16:00:13.518Z [INFO]  nomad.raft: election won: term=3 tally=1
2023-04-05T16:00:13.518Z [INFO]  nomad.raft: entering leader state: leader="Node at 127.0.0.1:4647 [Leader]"
2023-04-05T16:00:13.518Z [INFO]  nomad: cluster leadership acquired
2023-04-05T16:00:13.609Z [INFO]  nomad: eval broker status modified: paused=false
2023-04-05T16:00:13.609Z [INFO]  nomad: blocked evals status modified: paused=false
2023-04-05T16:00:21.936Z [INFO]  client.plugin: starting plugin manager: plugin-type=csi
2023-04-05T16:00:21.936Z [INFO]  client.plugin: starting plugin manager: plugin-type=driver
2023-04-05T16:00:21.936Z [INFO]  client.plugin: starting plugin manager: plugin-type=device


2023-04-05T16:01:11.937Z [WARN]  client.plugin: timeout waiting for plugin manager to be ready: plugin-type=driver
2023-04-05T16:01:11.938Z [INFO]  client: started client: node_id=fbc570b3-1c00-6fcd-5c97-7e67621b5784
2023-04-05T16:01:11.955Z [INFO]  client: node registration complete
2023-04-05T16:01:17.163Z [INFO]  client: node registration complete

I see: 2023-04-05T16:01:11.937Z [WARN] client.plugin: timeout waiting for plugin manager to be ready: plugin-type=driver

I can “talk” with docker daemon:

docker --tlsverify -H tcp://127.0.0.1:2376 --tlscacert /etc/ssl/docker/docker-ca.pem --tlscert /etc/ssl/docker/dc1-client-docker.pem --tlskey /etc/ssl/docker/dc1-client-docker.key ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

I have 2 interfaces. br0 and br1. I don’t know if cause something wrong.

I deploy Hashistack with personnal role. It work without problem on testings VM. But not on my production server.

Help! :ring_buoy: :sob:
Thanks!

Hi,
I try to understand and I found something weird:

2023-04-07T06:13:58.750Z [DEBUG] client.driver_mgr.docker.docker_logger.nomad: using TLS client connection to docker: driver=docker @module=docker_logger endpoint=tcp://127.0.0.1:2376 timestamp=2023-04-07T06:13:58.750Z
2023-04-07T06:15:37.880Z [DEBUG] client.driver_mgr.docker: error collecting stats from container: container_id=4268c6e1e78b0a1a6af5921236c2ef3966305d32ad722f48babb0e80b8f8a321 driver=docker error="context canceled"
2023-04-07T06:15:43.377Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins
2023-04-07T06:15:43.378Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins
2023-04-07T06:15:43.379Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/opt/nomad/plugins error=<nil>
2023-04-07T06:15:43.391Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/opt/nomad/plugins error=<nil>
2023-04-07T06:15:43.392Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2023-04-07T06:16:08.299Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins
2023-04-07T06:16:08.300Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins
2023-04-07T06:16:08.300Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/opt/nomad/plugins error=<nil>
2023-04-07T06:16:08.319Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2023-04-07T06:16:08.325Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/opt/nomad/plugins error=<nil>
2023-04-07T06:17:11.944Z [DEBUG] client.driver_mgr: detected drivers: drivers="map[:[docker] healthy:[exec] undetected:[qemu java raw_exec]]"
2023-04-07T06:21:21.945Z [DEBUG] client.driver_mgr.docker: could not connect to docker daemon: driver=docker endpoint=unix:///var/run/docker.sock error="Get \"http://unix.sock/version\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2023-04-07T06:21:21.945Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=docker health=undetected description="Failed to connect to docker daemon"

First line:

2023-04-07T06:13:58.750Z [DEBUG] client.driver_mgr.docker.docker_logger.nomad: using TLS client connection to docker: driver=docker @module=docker_logger endpoint=tcp://127.0.0.1:2376 timestamp=2023-04-07T06:13:58.750Z

And last:

2023-04-07T06:21:21.945Z [DEBUG] client.driver_mgr.docker: could not connect to docker daemon: driver=docker endpoint=unix:///var/run/docker.sock error="Get \"http://unix.sock/version\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

Why nomad trying to connect to endpoint=unix:///var/run/docker.sock while in configuration, tcp endpoint is set?

EDIT:
And this:

2023-04-07T06:43:01.799Z [DEBUG] client.driver_mgr.docker: could not connect to docker daemon: driver=docker endpoint=tcp://127.0.0.1:2376
  error=
  | API error (400): Client sent an HTTP request to an HTTPS server.

2023-04-07T06:43:01.799Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=docker health=undetected description="Failed to connect to docker daemon"

Thanks

This is the fallback behavior. See endpoint

Perhaps the client is falling back to HTTP because you’re specifying the wrong cert. For the plugin configuration, you’re supposed to provide the server’s tls cert and not the client’s.

Hi,

Thanks @macmiranda but with server cert, it also doesn’t work.

plugin "docker" {
    config {
        endpoint = "tcp://127.0.0.1:2376"

        tls {
          cert = "/etc/ssl/docker/dc1-server-docker.pem"
          key  = "/etc/ssl/docker/dc1-server-docker.key"
          ca   = "/etc/ssl/docker/docker-ca.pem"
        }
[...]

Works with CLI:

root@sandbox:~# docker --tlsverify -H tcp://127.0.0.1:2376 --tlscacert /etc/ssl/docker/docker-ca.pem --tlscert /etc/ssl/docker/dc1-server-docker.pem --tlskey /etc/ssl/docker/dc1-server-docker.key ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

:disappointed_relieved:

Any logs you can share?

2023-04-10T12:03:30.765Z [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
2023-04-10T12:03:30.776Z [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:22a35a3e-85b4-5897-ea73-afd244fb8efd Address:127.0.0.1:4647}]"
2023-04-10T12:03:30.777Z [INFO]  nomad.raft: entering follower state: follower="Node at 127.0.0.1:4647 [Follower]" leader-address= leader-id=
2023-04-10T12:03:30.777Z [INFO]  nomad: serf: EventMemberJoin: sandbox.global 127.0.0.1
2023-04-10T12:03:30.777Z [INFO]  nomad: starting scheduling worker(s): num_workers=2 schedulers=["batch", "system", "_core", "service"]
2023-04-10T12:03:30.778Z [WARN]  nomad: serf: Failed to re-join any previously known node
2023-04-10T12:03:30.778Z [INFO]  nomad: started scheduling worker(s): num_workers=2 schedulers=["batch", "system", "_core", "service"]
2023-04-10T12:03:30.778Z [INFO]  nomad: adding server: server="sandbox.global (Addr: 127.0.0.1:4647) (DC: dc1)"
2023-04-10T12:03:30.779Z [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad/plugins
2023-04-10T12:03:30.779Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/opt/nomad/plugins error=<nil>
2023-04-10T12:03:30.780Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/opt/nomad/plugins error=<nil>
2023-04-10T12:03:30.782Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
2023-04-10T12:03:30.782Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2023-04-10T12:03:30.782Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2023-04-10T12:03:30.782Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2023-04-10T12:03:30.782Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2023-04-10T12:03:30.787Z [INFO]  client: using state directory: state_dir=/opt/nomad/client
2023-04-10T12:03:30.787Z [INFO]  client: using alloc directory: alloc_dir=/opt/nomad/alloc
2023-04-10T12:03:30.787Z [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
2023-04-10T12:03:30.816Z [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
2023-04-10T12:03:30.870Z [INFO]  client.fingerprint_mgr.consul: consul agent is available
2023-04-10T12:03:30.884Z [INFO]  nomad.vault: successfully renewed token: next_renewal=383h43m38.499969968s
2023-04-10T12:03:30.897Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
2023-04-10T12:03:30.902Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=docker0
2023-04-10T12:03:31.035Z [INFO]  client.fingerprint_mgr.vault: Vault is available
2023-04-10T12:03:32.429Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2023-04-10T12:03:32.430Z [INFO]  nomad.raft: entering candidate state: node="Node at 127.0.0.1:4647 [Candidate]" term=7
2023-04-10T12:03:32.454Z [INFO]  nomad.raft: election won: term=7 tally=1
2023-04-10T12:03:32.454Z [INFO]  nomad.raft: entering leader state: leader="Node at 127.0.0.1:4647 [Leader]"
2023-04-10T12:03:32.454Z [INFO]  nomad: cluster leadership acquired
2023-04-10T12:03:32.571Z [INFO]  nomad: eval broker status modified: paused=false
2023-04-10T12:03:32.571Z [INFO]  nomad: blocked evals status modified: paused=false
2023-04-10T12:03:41.040Z [INFO]  client.plugin: starting plugin manager: plugin-type=csi
2023-04-10T12:03:41.041Z [INFO]  client.plugin: starting plugin manager: plugin-type=driver
2023-04-10T12:03:41.041Z [INFO]  client.plugin: starting plugin manager: plugin-type=device
2023-04-10T12:04:31.043Z [WARN]  client.plugin: timeout waiting for plugin manager to be ready: plugin-type=driver
2023-04-10T12:04:31.050Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=12a9149c-6be5-ef01-152c-c09816036ea3 task=mosquitto type=Received msg="Task received by client" failed=false
2023-04-10T12:04:31.055Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=12a9149c-6be5-ef01-152c-c09816036ea3 task=connect-proxy-mqtt type=Received msg="Task received by client" failed=false
2023-04-10T12:04:31.069Z [INFO]  client: node registration complete
2023-04-10T12:04:39.242Z [INFO]  client: node registration complete

The two lines for tasks events are tests with unix socket. But now, when I restart with tcp:// doesn’t work and I cannot access to Nomad UI.

I must say I’m starting to doubt the accuracy of the documentation.
If you’re using mutual TLS authentication (which you are, based on your ExecStart command), the client should in fact provide its own cert and key (besides the ca cert).

In any case, I just want to understand your motivation for using a tcp socket instead of a Unix one, since they’re both running locally. Could you share some insight about that? Is the Nomad client machine shared with other services?

I would like to use TCP connection to prepare to next steps.

If it not work locally, why it will work in cluster?

Thanks

It depends on your architecture. If each Nomad client uses its own local docker daemon, I don’t really see the point of using mutual TLS authentication.

OK.

For information, on arm64 it works.

On amd64, no…

And my prodcution machine is on amd64 :partying_face: