I don't understand networking between services

Hello everyone,
I’m feeling really dumb but I can’t make a simple scenario working. I spent hours in the docs and on the web trying to find a solution but I give up!

I’m testing using a very simple scenario:
I want several WordPress docker images connecting to a single MySQL database. (later I’ll integrate Traefik but let’s not get ahead of ourselves)

I don’t understand how to make those two independent jobs communicate with each other. I’m sure I’m confused with the different networking layers.

I’m running the following on MacOS with consul agent -dev and nomad agent -dev

Is sidecar the right tool for this?

Here are my 2 jobs:
Mysql:

job "mysql" {

    datacenters = ["dc1"]

    group "leader" {
        network {
          # Request for a static port
          port "mysql" { to = 3306}
        }
        task "mysql" {

            driver = "docker"

            config {
                image = "mysql"
                ports=["mysql"]
            }
            env {
                MYSQL_ROOT_PASSWORD = "root"
                MYSQL_DATABASE = "wordpress"
                MYSQL_USER = "wpuser"
                MYSQL_PASSWORD = "wppass"
            }
            service {
                name = "mysql"
                tags = ["global"]
                port = "mysql"

                check {
                    name = "mysql ping"
                    type = "tcp"
                    interval = "30s"
                    timeout = "2s"
                }
            }    
            resources {
                cpu = 500 #Mhz
                memory = 512 #MB
            }
        }
    }
}

Here is Wordpress:

job "wordpress" {

    datacenters = ["dc1"]


    group "wordpress" {

        network {
            port "http" { to = 80}
        }

        task "wordpress" {

            driver = "docker"

            config {
                image = "wordpress"
                ports=["http"]
            }

            env {
                WORDPRESS_DB_HOST = "127.0.0.1:22100"
                WORDPRESS_DB_NAME = "wordpress"
                WORDPRESS_DB_USER = "wpuser"
                WORDPRESS_DB_PASSWORD = "wppass"
            }

            service {
                name = "wordpress"
                port = "http"

                check {
                    name     = "500 error check"
                    type     = "http"
                    protocol = "http"
                    path     = "/"
                    interval = "30s"
                    timeout  = "2s"
                }
            }

            resources {
                cpu = 500 # Mhz
                memory = 256 # MB
            }

        }
    }
}

Question #1, why is MySQL unreachable from the WordPress job even with the ADDR hardcoded?
I know the DB works as I’m able to access my host on 127.0.0.1:22100

Question #2 how do I get around making it dynamic so that of course, the MYSQL ports are not hardcoded?

I’ve setup DNS resolution according to the docs:

dig @localhost -p 8600 mysql.service.consul. SRV

; <<>> DiG 9.10.6 <<>> @localhost -p 8600 mysql.service.consul. SRV
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13424
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 3
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;mysql.service.consul.		IN	SRV

;; ANSWER SECTION:
mysql.service.consul.	0	IN	SRV	1 1 22100 7f000001.addr.dc1.consul.

;; ADDITIONAL SECTION:
7f000001.addr.dc1.consul. 0	IN	A	127.0.0.1
Patricks-MacBook-Pro.local.node.dc1.consul. 0 IN TXT "consul-network-segment="

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Tue May 18 11:27:46 EDT 2021
;; MSG SIZE  rcvd: 177

If you kindly could point in the right direction on how to go about this, it’d be much appreciated!

Thank you

3 Likes

I was in the same situation 2 days ago :slight_smile: ! so check the following guide.

i think you may need linux as mentioned in the docs “Note: Nomad’s Connect integration requires Linux network namespaces. Nomad Connect will not run on Windows or macOS.”

there is an example that tell you how to configure the upstream

job "countdash" {
  datacenters = ["dc1"]

  group "api" {
    network {
      mode = "bridge"
    }

    service {
      name = "count-api"
      port = "9001"

      connect {
        sidecar_service {}
      }
    }

    task "web" {
      driver = "docker"

      config {
        image = "hashicorpnomad/counter-api:v3"
      }
    }
  }

  group "dashboard" {
    network {
      mode = "bridge"

      port "http" {
        static = 9002
        to     = 9002
      }
    }

    service {
      name = "count-dashboard"
      port = "9002"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "count-api"
              local_bind_port  = 8080
            }
          }
        }
      }
    }

    task "dashboard" {
      driver = "docker"

      env {
        COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
      }

      config {
        image = "hashicorpnomad/counter-dashboard:v3"
      }
    }
  }
}

for in depth explanation:

Hi @patrick-leb :wave:

Sorry to hear you got stuck with Nomad. The networking part is done a little different than people are probably used to, so it’s common to trip on some gotchas.

I will give you the short answer first to unblock you.


TL;DR

Since your are on a MacOS, the first thing you will need to do is start the Nomad and Consul agent binding to the right network interface. Check this link for more details: Frequently Asked Questions | Nomad | HashiCorp Developer.

Next, you will need to query Consul to retrieve the right IP and port for the mysql service. There are a few ways that you can do that, but using consul-template is probably the easiest one, so update your wordpress.nomad job like this (remove the +/-, I added those just for the syntax highlight):

job "wordpress" {
  # ...
  group "wordpress" {
    # ...
    task "wordpress" {
      # ...
      env {
-       WORDPRESS_DB_HOST     = "127.0.0.1:22100"
        WORDPRESS_DB_NAME     = "wordpress"
        WORDPRESS_DB_USER     = "wpuser"
        WORDPRESS_DB_PASSWORD = "wppass"
      }

+     template {
+       data        = <<EOF
+{{ range service "mysql" }}
+WORDPRESS_DB_HOST = "{{ .Address }}:{{ .Port }}"
+{{ end }}
+EOF
+       destination = "local/env"
+       env         = true
      }
  # ...
}

This will render a file with the value for the mysql service rendered. This fill will then be loaded as environment variables.

Because 127.0.0.1 inside your container is different from the 127.0.0.1 outside of it. You will need to use your computer’s IP.

That’s what I did above :slightly_smiling_face: By using Consul service discovery mechanism, you can dynamically render the right IP and port to reach your service.


Long answer

Nomad doesn’t actually do much in terms of networking. It’s a scheduler, so it will schedule your workload and assign ports (statically or dynamically) that you can use to access them using that host’s IP.

All tasks in the same group are guaranteed to be scheduled in the same host (that’s what’s called an allocation). Since they are running in the same host, they share the same local network, so Nomad automatically places them in the same network namespace. That’s why you can access one task from another using localhost, 127.0.0.1 or runtime environment variables.

But for tasks in different groups (including tasks in different jobs), there’s now way to create this shared namespace, so you will need some kind of catalog to store and query information about these random <IP>:<port> assignments. That’s what Consul’s Service Discovery is used for.

When you create a service block in your job, Nomad will automatically register the <IP>:<port> information in Consul. You can then query this later, either using Consul’s DNS interface, Consul’s API or the template block in a Nomad job (which I used in the example).

Quick recap: Nomad uses direct <IP>:<port> assignment to expose tasks. You need a way to record all of these assignments, and that’s what Consul does.

To add a bit more to the confusion is how nomad agent -dev and Docker Desktop works.

nomad agent -dev will bind to 127.0.0.1 by default, meaning that it will only be available in the host’s local network.

Docker Desktop will run a Linux VM to start your containers. It will expose its network to your host’s network, but this VM won’t be able to reach your local network. Binding Nomad to 0.0.0.0 allow your Docker workloads to talk to each other (0.0.0.0 is default without -dev).

The next CLI attribute that you need is -network-interface. This tells Nomad which network to use when assigning ports to allocations, and also the <IP> portion when registering services.

On -dev mode, Nomad will use the loopback interface for this, but, as we’ve seen before, this will cause problems with Docker Desktop because the services will be registered with the IP 127.0.0.1. When another task reads this information from Consul, it won’t be able to actually connect to 127.0.0.1 since it’s a different network space altogether (it will be inside the container network). By using a non-loopback interface, we avoid this problem.

Service discovery is just one way to handle this. Another one is what @Clivern mentioned, which is Consul Connect.

This mode requires a Linux host, so it won’t work for you, but the idea behind it is that Nomad will automatically deploy a sidecar proxy alongside your tasks. Consul will then automatically configure these proxies so they are able to communicate with each other.

So your mysql task will have one proxy, and your wordpress task will have another proxy that is pre-configured to be able to reach the mysql proxy. Since the proxies are in the same allocation, your app will be able to access it via localhost, so it will all look like a local network.


Sorry for the ginourmous answer, but networking can indeed get a bit tricky :sweat_smile:

Hopefully this helped you get a better mental model of what’s going, but feel free to ask any followup questions.

4 Likes

Hi @lgfa29 ,
First of all, I can’t thank you enough for taking the time to write such a long and in-depth explanation. It really did help me out a lot.

I was more used to the way Docker Swarm works so the change of mental mapping is a bit tedious!

I tried sidecar on Linux with sidecar and it works fine - I haven’t tried your templating way yet but I’m glad there is this option as well.

My question to you then is, why is it so complicated to connect services? The fact that Consul knows the IP:PORT of a service and has an elegant way of being accessible via SVR queries on its DNS, why isn’t there a super easy way in Nomad to query it?

I’m pseudo coding here, but something like:

 WORDPRESS_DB_HOST     = "${consul.dns.addr(mysql.service.consul)}"

Or

 WORDPRESS_DB_HOST     = "${consul.dns.ip(mysql.service.consul)}"
 WORDPRESS_DB_PORT     = "${consul.dns.port(mysql.service.consul)}"

Unless I’m missing something (and I probably am) in the big picture. This would make things so much easier.

If you don’t mind me asking a follow-up question on networking:
If I have a private network say 10.0.0.0/24 which I want to use to bind all my services except public load balancer / TLS termination (in my case Traefik)

How do I configure my agents? Consul has no business being reachable from the internet so I can bind it exclusively on the private network.

I’m guessing it’s a mix of Agent Configuration | Nomad by HashiCorp and Agent Configuration | Nomad by HashiCorp but I’m not sure how I’d go about things here.

Thanks again for your immense post and your patience :smile:

No problem, glad it helped and you were able to get it to work :slightly_smiling_face:

That’s a good question, and unfortunately there isn’t a good answer. I think the main issue in your case is the state of dev tooling.

Docker Desktop is running in a VM. They’ve a terrific job to abstract away it as much as possible now, to the point that you hardly think about it. But it’s still there, and so it can cause some issues.

nomad agent -dev also has its problems. It’s meant as a quick way to start a Nomad agent, but it’s also guarded to avoid exposing it to a wider network by default. Very easy to get started, but a trouble maker for anything slightly more complex. We are aware of these issues and trying to improve.

All of this to say that, in a more production-like environment, where Nomad and Consul are bound to a proper network interface, and Docker is likely running natively, you wouldn’t have this problem.

You would define two host networks:

client {
  host_network "private" { 
    cidr = "10.0.0.0/24"
  }

  host_network "public" { 
    cidr = "..."
  }
}

Then in your job, you can assign which network to use using the host_network attribute of the port:

job "wordpress" {
  # ...
  group "wordpress" {
    # ...
    network {
      port "http" {
        to           = 80
        host_network = "private"
      }
    }
    # ...
  }
}

At this point you don’t really need to worry about -bind and -network-interface since you are already providing this information via the host_network configuration blocks.

We’re always here to help. Thank you for trying Nomad :grinning_face_with_smiling_eyes:

1 Like

Hi @lgfa29 your answers are a fresh of air also for me. I’m quite stuck about nomad not finding my docker. I use colima as a replacement for docker desktop. Do you have any guides on how to do this? I’m quite stuck at the getting started guide when trying to run nomad run job example.job, and it always error-ing about not finding docker.

Hi @sawirricardo :wave:

I’m glad you found the answer helpful :slightly_smiling_face:

Can you provide more details on the problem you are having? Are you able to see Docker in green in your client details page?

@lgfa29 Great info, this and related topics really got me as well. (Typical challenge when setting up a production deployment and distinguishing the (very) different docs).

I was led to believe that a good way to handle a multi-node cluster’s internal communication is through a private wireguard network, which spans the nodes and devops hosts. I have setup vault, nomad and consul to do all cluster communication on this network.

This would make security measures regarding the cluster itself (like ACL configuration etc.) less urgent (I hope), because that’s another layer of complexity, as long as the devops team is trusted and we have only services exposed through a WAF?

Would it be a reasonable to choose the same network for the workloads (i.e. the mentioned host_private network)? Wouldn’t this also mean that we could have upstream services on different hosts transparently? And how does this relate to HA use cases (where I understood that consul would indeed direct an LB to refer to upstreams that might be running on other hosts)?

EDIT: there’s another related thing I don’t understand: if I declare something like:

    group "myservices" {
        count = 1

        network {
          mode = "bridge"
          port "myapi" {}
        }

        service {
            name = "mysvc"
            port = "myapi"
            tags = [ "mysvc" ]
    ...

I’d expect that NOMAD_ADDR_myapi points to an address:port in the bridge network, i.e.:

> networkctl | grep nomad
  4 wg_nomad     wireguard routable    configured
  5 nomad        bridge    routable    unmanaged

> networkctl status nomad | grep '^\s*Addr'
                       Address: 172.26.64.1

but instead the value is that of my hardware NIC (public interface on the VPS) with a dynamic port value. So this is problematic, because it instructs my in-container service to listen on a wrong address, but will surely also make consul relay the wrong address values to other services?