I am using Grafana Alloy (former Grafana Agent Flow) Docker integration to scrape logs and metrics from running jobs.
This is done using discovery.docker in alloy. Unfortunatelly discovery.nomad can’t be forwarded to Loki afterwards so docker discovery has to be used.
Somehow it gives me only 1 container target per running nomad instance, called nomad_init and doesn’t give all other containers.
Executing docker ps on my instance I can see the following:
ee82ad8133fc traefik:2.10.4 "/entrypoint.sh trae…" 22 minutes ago Up 22 minutes traefik-a95143b6-2095-e994-1140-68b5d608efe5
2e5ae3207cd9 gcr.io/google_containers/pause-amd64:3.1 "/pause" 22 minutes ago Up 22 minutes nomad_init_a95143b6-2095-e994-1140-68b5d608efe5
There is this pause_amd64 container still running with nomad_init name.
So 2 questions.
What is this pause container and why is it still running?
Is there anything that could prevent me from reading containers from unix:///var/run/docker.sock in nomad docker driver?
Since the group network.mode is bridge , Nomad creates the pause container to establish a shared network namespace for all tasks, but setting the task-level network_mode to bridge places the task in a different namespace. This prevents, for example, a task from communicating with its sidecar proxy in a service mesh deployment.
Docker containers do not have access to /var/run/docker.sock running on the host. Access to docker daemon on a host is a security issue, it is equal to giving root access. To have access to the docker daemon running on the host, it is typical to mount it with either mount or volume.
I am running grafana alloy as a service directly on the host, so it should have access.
I can call docker API myself through socket and see all available containers, but somehow alloy doesn’t catch it.
Under the hood alloy is reusing Prometheus docker discovery API.
I have also made sure I wait with alloy until all of the jobs and allocations are running.
If it is running on the host, then if it has permissions to access docker.sock, it should be then able to access everything docker.sock has to offer. Check permissions of the process and docker.sock permission bits. I know nothing of alloy, but if it has access, it has full access.
If there is nothing from Nomad perspective that can interfere with that (like docker.sock being put in a different location by the driver) I will try to address the issue with alloy
My latest guess so far is that issue is not in readying docker.sock – this part works fine.
Issue, though, that networking is managed by nomad, not by docker. So my services are not available from the docker perspective and discovery.docker targets can’t really be accessed by metrics and logs scrapers.
So no matter what I try, somehow when running jobs with nomad and then using discovery.docker (or docker_sd from prometheus) only the pause containers get into targets, but not all of the other running containers on the host.
Access is configured and calling docker API on the docker socket correctly lists all available containers.
I have rewritten the metrics to use discovery.nomad and that part works correctly.
I am now using discovery.docker only for logs, but nothing works.
So the real cause of the problem turned out to be the one discussed here and I wasn’t that far off by assuming it was related to networking.
Basically for docker.discovery to work every container needs to expose ports, even if you are only using it for logging that is available locally.