Nomad cluster for smart home

davosian1 · January 4, 2022, 9:38pm

Dear community,

having to maintain a bare metal kubernetes cluster at work, I am looking for an alternative to control my smart home. I would like to have a solution that lets me focus on the applications themselves, comes with less overhead to maintain and runs locally. Ideally, it works as simple as docker-compose.

In the past, I have been using docker-compose which will be my basis for the new setup. Compose has served me well with the exception of resiliency (running on one server only). Since docker swarm has no future, I have been looking for alternatives. Being new to nomad, I am hoping that this is the right path.

I started this thread, hoping that you can point me in the right direction, validating my assumptions.

Requirements

I would like to have some resiliency in my services with two servers handling the requests so in case one goes down, the other one can take over all other services (short downtimes would be acceptable).

Since it is a two people home only, the load will be quite low. It is more the reliability I am concerned about.

Some cluster services should be accessible from outside the home while other services should only be accessible from the internal home network.

Since the servers are running a smart home solution, they need access to the local network and its devices. Even access to Zigbee (via USB plugged into one server) and Bluetooth is required.

Persistent storage to run (small) databases and store some data.

Infrastructure as code, version controlled. Ideally, Gitops driven.

All persistent data and databases backed up to Backblaze B2.

Monitoring, centralized logging and alerting.

HTTPS for all services through a reverse proxy.

Selective external access to services and the bare metal servers through some type of VPN.

All secrets are encrypted inside my configurations & git repositories.

Some software I intent to run on the cluster:

home assistant
nextcloud
nodered
freshrss
paperless-ng

How I am planing on achieving the setup

I have 5 Intel NUCs at my disposal, 3 of them configured with additional SSDs. A NAS provides storage over NFS.

From what I understand, the mimimum number of servers I need for this setup is 5: 3 servers running Nomad, Consul and Vault (server mode) and two servers running Nomad (client mode).

Using cloud-init, I would prepare the 5 hosts with Ubuntu Server 20.04 and give them IP addresses from my local DNS pool (statically assigned). Also, the servers will be controllable through SSH (key provided with cloud-init).

For easy setup, I am planning on using hashi-up. This should give me the base installation with Nomad, Consul and Vault up and running in HA mode. Two servers would be set up as workers (nomad client).

The setup would be done with Task due to simplicity.

So far, I feel this setup is straight forward. Now things become less clear…

How stable are the Nomad Packs ? Should I rather go down this path for all the software available and only create my own jobs where needed?

Ingress & reverse proxy: I am thinking of following the guide Load Balancing with Traefik with input from this blog post. Or would Fabio be a simpler solution (I have never used it)? Traefik should be able to provide ssl certificates so that I do not need another piece of software.

DNS: inital setup would be done with Terraform on Cloudflare.

Persistent storage: no idea yet. Ideally, I can have a solution that leverages the fast SSDs from my three NUCs and for slower, larger storage leveraging my NAS through nfs. Since I have no knowledge in this area, I prefer a simple setup.

Monitoring and alerting: Using Prometheus to Monitor Nomad Metrics. Not sure how to monitor my own services though. What configuration would be required?

Not sure how to set up Grafana though. Any hints?

Logging: Logging on Nomad and log aggregation with Loki

Backups: looks like I would need to build something myself. Thinking of using restic like the gentleman in this repo: hydra/terraform/modules/restic at master · mr-karan/hydra · GitHub. This is what it does: hydra/foss-united-apr-2021.md at master · mr-karan/hydra · GitHub

VPN: I am thinking of using Tailscale or Zerotier to connect securely to some services on my home cluster by running two proxy instances like the gentleman overhere: hydra/foss-united-apr-2021.md at master · mr-karan/hydra · GitHub. I am not sure on the details though since networking is not my strong suite. Should be interesting…

USB: looks like Nomad provides this access in beta mode. No idea how to hook into Bluetooth, though.

For kubernetes I found a great resource to set up a cluster for home usage: https://k8s-at-home.com/. Is there anything like it for nomad and the hashicorp stack?

I am looking forward to your ideas and inputs.

davosian1 · January 5, 2022, 4:26pm

The closest setup I found for what I am after is the one by perrymanuk. At a first glance, this repo provides a lot of features to get me going into the right direction. Excited to give it a shot!

brettpatricklarson · January 5, 2022, 8:20pm

I believe they came out recently, however it’s a templating engine, I don’t imagine you’d run into many issues with bugs.

Grafana should be able to run incluster - it just points to a data source like Prometheus.

I would personally look into Wireguard on the hosts - if you can setup SSH you may find it’s a similar / elegant SSH solution.

For my home lab I have been using cloud-init as well to setup Nomad - here is my dead simple setup:

#cloud-config
system_info:
  default_user:
    name: myuseraccount
    sudo: ALL=(ALL) NOPASSWD:ALL
    groups: [docker]
groups:
  - docker
apt:
  sources:
    docker.list:
      source: deb [arch=amd64] https://download.docker.com/linux/ubuntu $RELEASE stable
      keyid: 9DC858229FC7DD38854AE2D88D81803C0EBFCD88
    hashicorp.list:
      source: deb [arch=amd64] https://apt.releases.hashicorp.com $RELEASE main
      keyid: E8A032E094D8EB4EA189D270DA418C88A3219F7B
package_update: true
package_upgrade: true
packages:
  - docker-ce
  - docker-ce-cli
  - consul
  - nomad
write_files:
- content: |
   datacenter = "dc1"
   data_dir = "/opt/nomad"
  path: /etc/nomad.d/nomad.hcl
- content: |
   client {
      enabled = true
    }
  path: /etc/nomad.d/client.hcl
- content: |
   server {
      enabled = true
      bootstrap_expect = 3
      server_join {
          retry_join = ["someipgoeshere:4648"]
      }
    }
    acl { 
      enabled = true
    }
  path: /etc/nomad.d/server.hcl
- content: |
    [Unit]
    Description=Nomad
    Documentation=https://www.nomadproject.io/docs/
    Wants=network-online.target
    After=network-online.target

    # When using Nomad with Consul it is not necessary to start Consul first. These
    # lines start Consul before Nomad as an optimization to avoid Nomad logging
    # that Consul is unavailable at startup.
    #Wants=consul.service
    #After=consul.service

    [Service]

    # Nomad server should be run as the nomad user. Nomad clients
    # should be run as root
    User=root
    Group=root

    ExecReload=/bin/kill -HUP $MAINPID
    ExecStart=/usr/bin/nomad agent -config /etc/nomad.d
    KillMode=process
    KillSignal=SIGINT
    LimitNOFILE=65536
    LimitNPROC=infinity
    Restart=on-failure
    RestartSec=2

    ## Configure unit start rate limiting. Units which are started more than
    ## *burst* times within an *interval* time span are not permitted to start any
    ## more. Use `StartLimitIntervalSec` or `StartLimitInterval` (depending on
    ## systemd version) to configure the checking interval and `StartLimitBurst`
    ## to configure how many starts per interval are allowed. The values in the
    ## commented lines are defaults.

    # StartLimitBurst = 5

    ## StartLimitIntervalSec is used for systemd versions >= 230
    # StartLimitIntervalSec = 10s

    ## StartLimitInterval is used for systemd versions < 230
    # StartLimitInterval = 10s

    TasksMax=infinity
    OOMScoreAdjust=-1000

    [Install]
    WantedBy=multi-user.target
  path: /etc/systemd/system/nomad.service
runcmd:
  - [systemctl, enable, nomad]
  - [systemctl, start, nomad]
  - [systemctl, enable, docker]
  - [systemctl, start, docker]

This obviously doesn’t have the vault / consul features, but it can build a quick cluster in multipass.

Hope this helps! Sorry I don’t have all the answers myself as I just started with Nomad.

mmeier86 · January 7, 2022, 5:27pm

Hi @davosian1,

with at least a few of these I might be able to help. A bit of quick context: I’m also a Nomad newcomer, and I’m also currently migrating a docker-compose based setup to Nomad.

It really doesn’t.
One point to look out for when coming from docker-compose: At least as far as I’m aware, Nomad doesn’t have any equivalents to docker-compose’ up and down commands. While Nomad can of course stop and start entire jobs, there is no concept of bringing down all the jobs you’ve currently got running, or to bring up all the jobs in e.g. a certain directory. So you may need a bit of scripting if you need that.

One other thing missing from Nomad compared to docker-compose is service dependencies, in particular: Startup/shutdown dependencies similar to docker-compose’ depends_on config option. Instead the approach is to simply restart a container until it stays up.

I think this may be problematic, from a “getting the device showing up in your docker container” point of view. But you said that you’ve already got it all running with docker-compose? If so, this might not actually be a problem?
The second thing which may seem problematic is how to tell jobs which need the Zigbee stick on which nodes they are supposed to run: That’s not a problem, Nomad supports setting arbitrary metadata in the Nomad client config which can be used as a constraint on the job to decide where it’s allowed to run. See the client config.

This is doable in a number of ways. The simplest way is host volumes. These are just directories on the host which are configured as potential mounts in the Nomad client config and then used by any Nomad job. These can for example be NFs shares mounted on your Hosts.
You can also go a lot more fancy, as Nomad also supports the Container Storage Interface (CSI) specification, allowing use of Ceph and the like. At least for Ceph, I can confirm that the official Ceph CSI plugin works with Nomad, and it’s stable. But the configuration and handling of volumes is not at all what I expected, but that might be on the Ceph CSI plugin, not on Nomad.

GitOps driven should be possible. One of the idiomatic ways to get configuration files into Nomad jobs is via the artifact stanza, which also supports checking out a git repository and making its contents available to the job’s tasks.
Very important point: You cannot simply access random paths on the hosts running your Nomad clients. By default, you only have access to the allocation directory and what you can download via the aforementioned artifact stanza. This is currently my biggest headache. For more details, have a look at my post on the matter here.

Short note: You only need one server of each, not three, unless you really want HA mode.
For a setup like this, I also wouldn’t “sacrifice” three dedicated machines just for the servers. I don’t think they produce that much load. While running Nomad Servers and Clients on the same machine is advised against and probably needs some cautious configs concerning port usage, why not run VM’s on the machines and then co-locate Server and Client VMs? Dedicating three physical machines to the servers seems like a waste.

I’ve got Grafana, Loki, Fluentd and Prometheus running on my old docker-compose stack and plan to migrate them to Nomad. I don’t see a problem with that.
On the logging, I can already comment as I just finished my setup for that.

It looks as follows:

On each Nomad client, configure the Docker task driver in the client config with the Fluentd logging driver. See the config options here. This configures all Docker containers started by Nomad to use the Fluentd driver.
Set up a fluentbit job as a system job running on each node and listening on the port configured for the logging driver in the previous point.
Set up a local host_volume on each Nomad node, which gets mounted into the fluentbit job and all other jobs which need to write into log files, instead of writing to stdout
In my setup, the fluentbit doesn’t do much more than collecting the jobs and sending it to a fluentd instance which does the heavy lifting of parsing the log lines
Depending on taste, you can now either forward the logs to Loki directly from the fluentbit jobs, or have a central fluentd job as an aggregator and log sorter.

Pro Tip: Manually set the local docker driver for the fluentbit job - otherwise, you may end up in a reinforcing loop of recurring log lines which will blow up your Docker deamon’s memory. Ask me how I know

I’ve personally always felt uncomfortable with committing secrets into git - even encrypted. And you wrote previously that you already plan to set up Vault, so why not use that? I’ve got Vault running, and while it takes some setting up with tokens, logins auths and so forth, it does really pay off. The integration between Nomad and Vault is really good.

One point on storing secrets in your repo encrypted: You will have to think about decrypting them at some point. You’ve got several ways to solve that, depending on how you plan to handle your config files.
One potential way, if you want to do the decrypting with/inside Nomad jobs: The lifecycle stanza allows you to run tasks (containers) as “init tasks”. Those can mount a shared directory, download your config repo, decrypt the secrets on the shared mount, and then exit. Afterwards, the actual application can access the decrypted secrets.

As far as I understand (but perhaps something the Nomad team is better suited to comment on?) Nomad Packs is currently still considered a tech demo.

I’m currently in the process of setting up Traefik, where in my current production setup I’m running Nginx. Right now, it works well. I’ve got it running with a Let’s encrypt cert and using the Consul catalog provider. This means I don’t have static configs for sites anymore. Instead, my Nomad jobs register themselves with Consul, and Traefik in turn periodically contacts Consul for a list of services. And because Consul only lists healthy services, this ought to also cover automatic fallbacks, but I haven’t tried that yet.
Caution: Setting up Traefik is my current project - so I might simply not have hit the real roadblocks yet

I’d use Grafana as a prometheus frontend for visualization. The prometheus+grafana setup has served me well for both, host metrics and service metrics gathering. One nice point about using Prometheus: Like Traefik, it supports service discovery via Consul. So it might be able to autodetect your running apps if they all register themselves as services in Consul.
I can’t say how well this works, as I’ve not yet migrated my monitoring stack to Nomad, but at least for the base prometheus+grafana setup I don’t see any reason why it shouldn’t work.

And I think that completes my novel on Nomad

davosian1 · January 8, 2022, 7:54am

thanks for sharing your thoughts on this, @brettpatricklarson

Since I am new to Nomad, I might give it a go (at least for inspiration), since I most likely would not know any better myself

I was under the impression that it needs some storage which complicates things, but once I have figured out how to go about it, installation grafana should be no big hurdle.

Tailscale is based on Wireguard. The advantage would be an essier setup without messing with complicated firewall configs.

Thanks a lot for providing your setup. Will draw inspiration from it

Multipass: something which I use for prototyping as well. Especially because I am on a M1 based Mac and using vagrant is not well supported due to the lack of virtualbox support for arm systems.

davosian1 · January 8, 2022, 8:41am

Hi @mmeier86,

thanks a lot for your novel. You should publish it

I am glad I can learn from your experience!

Interesting. Different philosophy, more like Kubernetes, I guess. However, since my workloads should run more or less all the time, I might be ok with it. Definitely something to keep in mind for deciding what goes in a job. I would probably use separate databases for different jobs to keep things independent of each other.

Having docker-compose tasks depend on each other: in Kubernetes there is the concept of readiness probes, in other words a pod can check whether an other one is ready. Taking this to Nomad, I would be ok with containers restarting until another one is ready. Sounds a bit weird, but is a solid concept.

I think I have figured it out for my use case: I really need two things: network mode host and zigbee. The first one is available in Nomad afaik and for the latter I could use a pi connected to my zigbee controller which passes the info on to the cluster through zigbee2mqtt. No need for bluetooth at the moment.

This would also allow me not to tie any jobs to a certain node giving me more resiliency.

I will probably explore host volumes with nfs and a glusterfs csi based setup (see my next post - have worked on this already). Ceph feels overly complex to me and is probably overkill. I simple would like to take advantage of the fast ssd storage in my NUCs since connecting to my nfs on the NAS is certainly way slower.

The setup you described, can it also sync the state of the git repo with the state in the cluster (both ways) like it can be done in Kubernetes with the help of controllers (e.g. argocd)? If not, it would not be the most important thing. Just trying to understand what I get with Nomad and where its limiations are. Since I am aiming for a less complex setup than using Kubernetes, I am ok with a few trade-offs.

Thanks for the heads-up. I hope I find alternatives

Since my setup is not really a lab, I am striving for a HA setup so that one server can go down without taking everything with it. The setup will power our house and therefore the wife acceptance factor is critical. This is also one of the main reasons I am moving away from docker-compose. I am aware that my servers will be bored but this is the only way I can think of to get the resiliency. Is it possible to run nomad in dev mode with HA? In other words, can I run the agent in server and client mode across 3 machines to get HA while not needing additional machines for workers? Then I could reduce my setup from 5 to 3 machines.

Nice! Now I know who to ask if I get stuck Thanks for outlining the details.

I probably will. I am just a newbie with Vault as well, so there will be some additional learing waiting for me

Glad to hear - will go down this path as well.

davosian1 · January 8, 2022, 9:04am

Ok, I started going down this rabbit hole and decided to document my journey: GitHub - davosian/home-cluster-v2: Cluster running apps and services for my smarthome

I decided on using a public cloud for initial testing since for one others can easiliy replicate my setup and also because I am not ready to tear down my physical servers yet until I have a clear picture on where to go.

This is what I have accomplished so far:

Provision hardware
Provide storage
Install nomad, consul and vault
VPN access to the cluster
Initial configuration for Vault

I am currently working on the ingress setup with traefik as reverse proxy and load balancer for my workloads. Then I will put a cloud load balancer in front to enable certificate handling (using Cloudflare for dns).

Keeping my fingers crossed

mister2d · March 2, 2022, 2:47am

@davosian1 Just following up. How is your homelab going so far? I just discovered this thread but I’ve been running a hashi homelab for about 2 years on ARM with UPS. It’s been working great.

davosian1 · March 15, 2022, 9:46am

This is great to hear, @mister2d! I have my cluster running on Hetzner Cloud for the time being and am currently setting up an observabilty stack with prometheus and friends. After this, I consider the cluster base setup as complete and will try to get my first real workloads into the cluster.

I have even been thinking about running Home Assistant in the cloud cluster and connect it to my home network through Zerotier. I am not sure whether this will work well since the communication through the VPN will be a challenge for some integrations, but most of mine use mqtt and zigbee2mqtt, so that I will only have a few bridges running in house. The beauty of this would be the advantages a “private cloud” would give me like low maintenance, scaling, back-ups etc. As a fall-back plan, I would move the cluster onto my local hardware.

Do you have your repo published on github for inspiration? I try to keep up with documenting my setup at the link in this thread.

mister2d · March 15, 2022, 4:29pm

Ah great to hear. I do not have my repos published publicly. They’re currently housed in a self hosted GitLab instance. I suppose it is time to simultaneously publish it to github. I’ll get on that and keep you posted on this thread.

I used to run my Home Assistant stack on a VPS provider connected to a VLAN in my home network. With Wireguard that worked great. On the VPS side I put the wireguard interface in the trusted firewalld zone and configured my home router as the other site endpoint.

I’ve since migrated Home Assistant in-house an added local object detection for my security cameras (Home Assistant, MQTT, Frigate, Deepstack, and Compreface). All of it scheduled with Nomad! Love this software.

CarbonCollins · March 22, 2022, 10:14am

If your looking for another Home lab for reference, I have most of my repos published publicly here: Home Lab · GitLab (some are hidden but feel free to send me a DM if you have any questions about repos that may be hidden as it’s likely that I do have something that you are looking for )

There are other services than just home automation there too so feel free to have a poke around.

I seem to have similar requirements in regards to having a node go down without taking all the services with it (as that is exactly what happened with my last proxmox setup before moving to Nomad) for the most part with services that don’t have a HA mode I have a CSI volume/s loaded in using the democratic-csi package to allowing the job to run on different nodes.

There should also be a repo for a traefik ingress which is hooked up with consul connect and lets encrypt certs with auto service discovery which might be useful for some (from memory it’s the internal-proxy repo)

There was a mention of the USB plugin which is still in beta… i’ve been running it for quite some time to pass a Conbee 2 stick into deConz. Still need to get around to updating it a bit and hopefully getting it out of beta i’ve not been keeping up with that one . With that there was also mention about a bluetooth plugin which I’ve also been looking into potentially making as I wanted to look into room occupation for home assistant so if your still in need of that as a plugin let me know and I can see if I can move it up on my long list of projects

robertmack919 · November 11, 2024, 10:45pm

hey mate, for the Nomad Packs, they can be a good start, but customizing jobs might give you better control for your unique setup. Traefik is a great choice for Fix ingress and reverse proxy, especially with its built-in SSL capabilities. For storage, a mix of fast SSDs and NFS for larger data sounds practical. Monitoring with Prometheus and Grafana will be powerful—plenty of tutorials out there to guide you through setup. Good luck

dennis.staiger · November 12, 2024, 12:48pm

Thanks for the hints, @robertmack919

matthias · November 14, 2024, 6:36pm

Hey and welcome to the Hashistack side of the world!

I started my journey with Nomad and Consul about two years ago during Covid, and it looks like your goals are pretty much the same as mine.

I posted a short run-down of my setup in this forum some time ago, maybe you want to have a look.

The easiest way to do this is probably by using cloudflared

I am running a DMZ in my homelab, with two VMs on my compute nodes configured to use their own VLAN. Cloudflared is running on those machines, linked to a DMZ Traefik instance via Consul Connect.

You could certainly use Vault, but the “Variables” feature of Nomad should be sufficient. I’m using it mostly to keep sensitive information out of my job files since I’m sharing the jobs on github.

In a homelab, you can 100% run everything on just three nodes. Nomad and Consul can be configured to run both in server and agent mode.

Pure agents can make sense for a DMZ.

I’m using Consul Connect for ingress and it works quite well.

Keepalived assigns a floating IP to one of my two compute nodes. If the active node goes down, the IP switches to the other node.
Both nodes are running Consul Connect ingress-gateway to pick up the incoming traffic and route them to the target services. UDP needs to be handled by nginx since Consul Connect supports TCP only.
For http(s) this would be Traefik, which in turn routes the traffic to the final destinations.

I really love core-dns for my internal DNS. Makes it super simple to route all the .consul DNS requests to Consul, without any of my services needing to know about Consul.

Besides, core-dns can be easily clustered across my two compute nodes. As long as one compute node of my cluster is up, DNS is working.

Hehe, I’m a sucker for graph-pr0n. Have a look at these three jobs, they collect everything from node metrics to Traefik, core-dns and my Unifi Network app. Plus data flows between the applications in the Nomad cluster.

Just using my Synology NAS to expose shares via NFS. The Syno takes care of snapshots, local and cloud backups.

For online backups of databases, you can use Nomad Actions to trigger the actual backups. The backup target would be the NFS share, where my Syno will handle the 3-2-1 routine.

If your Bluetooth device is on a Linux machine, you can just map the host device into the container via the volumes directive in the config section of the task.

I know this is a lot. Have fun and don’t try to do everything at once. The journey is the goal

matthias · November 14, 2024, 6:50pm

Discovery of Consul services works great with Prometheus. This is my scrape config which is collecting a wealth of information about the state of my cluster:

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 30 seconds. Default is every 1 minute.
  evaluation_interval: 1m # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - alert.yml
  # - "first_rules.yml"
  # - "second_rules.yml"

# scrape configurations
scrape_configs:
  - job_name: node-exporter-host
    # metrics_path: /metrics # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
      - targets:
          - 'storage.home:9100'
    relabel_configs:
    - source_labels: [__address__] # strip port from instance name
      regex: ([^:]+):.*
      replacement: ${1}
      target_label: instance

  - job_name: node-exporter-consul
    consul_sd_configs:
    - server: 'consul.service.consul:8500'
      services: ['node-exporter']
    relabel_configs:
    - source_labels: [__address__, __meta_consul_service_metadata_metrics_port]
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: ${1}:${2}
      target_label: __address__
    - source_labels: [__meta_consul_node] # add ".home" to Consul node name
      regex: (.*)
      replacement: ${1}.home
      target_label: instance

  - job_name: nomad
    consul_sd_configs:
    - server: 'consul.service.consul:8500'
      services: ['nomad-client', 'nomad']
    relabel_configs:
    - source_labels: [__meta_consul_service]
      action: drop
      regex: (.+)-sidecar-proxy
    - source_labels: ['__meta_consul_tags']
      regex: '(.*)http(.*)'
      action: keep
    - source_labels: [__meta_consul_node]
      regex: (.*)
      replacement: ${1}.home
      target_label: instance
    metrics_path: /v1/metrics
    params:
      format: ['prometheus']

  - job_name: envoy-consul
    consul_sd_configs:
    - server: 'consul.service.consul:8500'
    relabel_configs:
    - source_labels: [__meta_consul_service]
      action: drop
      regex: (.+)-sidecar-proxy
    - source_labels: [__meta_envoy_cluster_name] # drop metrics for Envoy internal traffic
      action: drop
      regex: local_agent
    - source_labels: [__meta_envoy_cluster_name]
      action: drop
      regex: local_app
    - source_labels: [__meta_envoy_cluster_name]
      action: drop
      regex: self_admin
    - source_labels: [__meta_consul_service_metadata_envoy_metrics_port]
      action: keep
      regex: (.+)
    - source_labels: [__address__, __meta_consul_service_metadata_envoy_metrics_port]
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: ${1}:${2}
      target_label: __address__
    - source_labels: [__meta_consul_node]
      regex: (.*)
      replacement: ${1}.home
      target_label: instance
    - source_labels: [__meta_consul_service]
      regex: "(.+)"
      replacement: ${1}
      target_label: "service_name"

  - job_name: envoy-prometheus # for the local Prometheus proxy, prometheus cannot connect to it's own proxy
    static_configs: 
      - targets:
          - 'localhost:9102'
    relabel_configs:
    - source_labels: [__address__] # strip port from instance name
      regex: ([^:]+):.*
      replacement: ${1}
      target_label: instance

  - job_name: 'consul'
    metrics_path: /v1/agent/metrics
    params:
      format: ['prometheus']
    static_configs:
      - targets:
          - 'compute1.home:8500'
          - 'compute2.home:8500'
          - 'master.home:8500'
    relabel_configs:
    - source_labels: [__address__] # strip port from instance name
      regex: ([^:]+):.*
      replacement: ${1}
      target_label: instance

  - job_name: 'traefik'
    consul_sd_configs:
    - server: 'consul.service.consul:8500'
      services: ['traefik-home-api','traefik-dmz-api']
    relabel_configs:
    - source_labels: [__address__, __meta_consul_service_metadata_metrics_port]
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: ${1}:${2}
      target_label: __address__
    - source_labels: [__meta_consul_node]
      regex: (.*)
      replacement: ${1}.home
      target_label: instance

  - job_name: 'immich'
    consul_sd_configs:
    - server: 'consul.service.consul:8500'
      services: ['immich-api','immich-worker']
    relabel_configs: # see https://github.com/immich-app/immich/discussions/8191
    - source_labels: [__address__, __meta_consul_service_metadata_metrics_port]
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: ${1}:${2}
      target_label: __address__
    - source_labels: [__meta_consul_node]
      regex: (.*)
      replacement: ${1}.home
      target_label: instance
    metric_relabel_configs:
    - source_labels: [ __name__ ]
      target_label: __name__
      regex: '(.*)'
      action: replace
      replacement: immich_${1}

  - job_name: 'cloudflared'
    consul_sd_configs:
    - server: 'consul.service.consul:8500'
      services: ['ingress-cloudflare']
    relabel_configs:
    - source_labels: [__address__, __meta_consul_service_metadata_metrics_port]
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: ${1}:${2}
      target_label: __address__
    - source_labels: [__meta_consul_node]
      regex: (.*)
      replacement: ${1}.home
      target_label: instance

  - job_name: 'core-dns'
    static_configs:
      - targets:
          - 'compute1.home:9153'
          - 'compute2.home:9153'
    relabel_configs:
    - source_labels: [__address__] # strip port from instance name
      regex: ([^:]+):.*
      replacement: ${1}
      target_label: instance

  - job_name: prometheus
    static_configs:
      - targets:
          - 'localhost:9090'
    relabel_configs:
    - source_labels: [__address__] # strip port from instance name
      regex: ([^:]+):.*
      replacement: ${1}
      target_label: instance

  - job_name: cadvisor
    static_configs:
    - targets:
      - storage.home:9117
    relabel_configs:
    - source_labels: [__address__] # strip port from instance name
      regex: ([^:]+):.*
      replacement: ${1}
      target_label: instance

  - job_name: snmp # https://mariushosting.com/monitor-your-synology-with-grafana-and-prometheus-dashboard/
    static_configs:
      - targets: 
          - 'storage.home'
    metrics_path: /snmp
    params:
      auth: [snmp_v3]
      module: [synology]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - source_labels: [__param_target] # wtf? This configures the SNMP Exporter endpoint      
        regex: (.*)
        replacement: localhost:9116
        target_label: __address__

  - job_name: 'unifipoller'
    static_configs:
    - targets: 
      - 'localhost:9130'
    relabel_configs:
    - source_labels: [__address__] # strip port from instance name
      regex: ([^:]+):.*
      replacement: ${1}
      target_label: instance

davosian1 · November 15, 2024, 10:13am

This is exactly the kind of information I was hoping to have had when I started on my journey. Great job @matthias !

In the end, such a setup is still quite involved compared to a one machine dev setup.

Now that nomad has some basic service discovery included, I am wondering whether such a setup would also be possible without consul. But then I would be lacking the ingress side of things, I imagine?

In the meantime, I went to the dark side and started exploring Docker Swarm. Just having to deal with docker compose files and have a simple 3 node cluster up and running on Proxmox with a few commands is very appealing although it cannot deliver on the feature side compared to Nomad and Co.

matthias · November 15, 2024, 11:11am

Dang! Didn’t notice that your original post is now two years old.

Maybe it’s time to upgrade your cluster again, there’s probably not much to improve on your Docker Swarm setup now

Correct, for the ingress stuff you need Consul. Plus the monitoring of the data-flows between the services depends on Consul Connect.

system · December 15, 2024, 11:12am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Dockerized Consul & Nomad Cluster difficulties Nomad	2	432	November 8, 2024
Host volumes orchestration Nomad nomad	0	256	March 25, 2023
Migrating from docker-swarm to nomad – questions and help needed Nomad	0	1053	February 13, 2023
Terraform, VM Cluster Software, and setting up a Nomad cluster Terraform	0	927	July 5, 2020
Nomad as a mid-tier orchestrator Nomad	1	592	April 21, 2021

Requirements

How I am planing on achieving the setup

Related topics