I am looking for some way to be able to monitor the tasks, what I really want is that when a container restarts I can report with an alarm, my stack is prometheus+alartmanager, I have everything integrated but I can’t find the indicated metrics/labels that it can give me when a service was restarted, before I was monitoring the services using systemctl, but now it’s all containerized and orchestrated by nomad, any ideas?
Similar topic without response (How to monitor restarts with nomad_client_alloc_restarts)
This is interesting, I’m also interested in this use case. The point is that if the docker reboot itself I don’t think this creates a new allocation, that’s why in nomad’s viewpoint nothing has changed. (If I’m not wrong).
The only path to get this info would be to extract this info from docker. This might be possible with a cAdvisor job and use the
container_start_time_seconds metric for firing an alert when it’s
0 or under 1 minute let’s say.
I don’t know if there is a better way to achieve this, but I think this would make the trick. I will test myself when I have some time for this.
By the way, I have just setup cadvisor as a system job and I successfully send metrics to Grafana Cloud to monitor when the docker containers are restarted. So this would be a suitable solution.
If anyone have a better solution just let me know, please. I might open an issue to export these metrics in the next version of nomad native telemetry.
I have stop the apache2 container (not the job), so nomad noticed the apache2 job was
running but there was not actually any container running, so nomad launched another container. That’s why it says 0days for apache2, previous value was 6 days, the same as apache job. (This is my home lab with a raspberry pi)
I just wanted to point out that the
container_start_time_seconds metric stores the timestamp when the container started with epoch time.
This is the dashboard I’m using Cadvisor exporter | Grafana Labs
I want to share the way to achieve send alert where one container restarted, I used the following PQL last_over_time(nomad_client_allocs_restart[5m]) > 0