How do I debug a networking problem?

I’m trying to run the Consul Connect example.

After i run the job, which claims to run successfully, I have a failure

I’m not sure that is the actual problem, it might just be symptoms of a different problem.

When I run nomad job plan countdash.nomad without changing anything, I get a scheduler dry-run warning:

- WARNING: Failed to place all allocations.
  Task Group "api" (failed to place 1 allocation):
    * Resources exhausted on 1 nodes
    * Dimension "network: no addresses available for \"\" network" exhausted on 1 nodes

I don’t know how this is the case, as there’s literally only this job running.
How do I debug this? What do I look at to figure out why it’s not working? I don’t have ACLs enabled yet, and I’m using Consul to discover cluster members.

For reference, the entire job plan output:

> nomad job plan countdash.nomad
+/- Job: "countdash"
+/- Task Group: "api" (1 create/destroy update)
  +   Network {
      + MBits: "0"
      + Mode:  "bridge"
      + Dynamic Port {
        + HostNetwork: "default"
        + Label:       "connect-proxy-count-api"
        + To:          "-1"
        }
      + Dynamic Port {
        + HostNetwork: "default"
        + Label:       "svc_count-api_ck_01ccb6"
        + To:          "-1"
        }
      }
  -   Network {
      - MBits: "0"
      - Mode:  "bridge"
      - Dynamic Port {
        - HostNetwork: "default"
        - Label:       "connect-proxy-count-api"
        - To:          "-1"
        }
      - Dynamic Port {
        - HostNetwork: "default"
        - Label:       "svc_count-api_ck_e6a264"
        - To:          "-1"
        }
      }
  +/- Service {
        AddressMode:       "auto"
        EnableTagOverride: "false"
        Name:              "count-api"
        PortLabel:         "9001"
        TaskName:          ""
    +/- Check {
          AddressMode:            ""
          Command:                ""
          Expose:                 "true"
          FailuresBeforeCritical: "0"
          GRPCService:            ""
          GRPCUseTLS:             "false"
          InitialStatus:          ""
          Interval:               "10000000000"
          Method:                 ""
          Name:                   "api-health"
          Path:                   "/health"
      +/- PortLabel:              "svc_count-api_ck_e6a264" => "svc_count-api_ck_01ccb6"
          Protocol:               ""
          SuccessBeforePassing:   "0"
          TLSSkipVerify:          "false"
          TaskName:               ""
          Timeout:                "3000000000"
          Type:                   "http"
        }
    +/- ConsulConnect {
          Native: "false"
      +/- SidecarService {
            Port: ""
        +/- ConsulProxy {
          LocalServiceAddress: ""
          LocalServicePort:    "0"
            }
          }
        }
      }
      Task: "connect-proxy-count-api"
  +/- Task: "web" (forces in-place update)


    Task Group: "dashboard" (1 in-place update)
      Task: "connect-proxy-count-dashboard"
      Task: "dashboard"

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "api" (failed to place 1 allocation):
    * Resources exhausted on 1 nodes
    * Dimension "network: no addresses available for \"\" network" exhausted on 1 nodes

Job Modify Index: 11486
To submit the job with version verification run:

nomad job run -check-index 11486 countdash.nomad

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

The job file itself. I have a constraint on it, because I have nomad running on an arm64 box, and these images can’t run on that architecture. I have also stopped nomad and consul on that box, just in case it was something weird. I still have the same warning, so I don’t believe it’s related.

job "countdash" {
  datacenters = ["discovery"]

  group "api" {
    network {
      mode = "bridge"
    }

    service {
      name = "count-api"
      port = "9001"

      connect {
        sidecar_service {}
      }

      check {
        //address_mode = "driver"
        expose   = true
        type     = "http"
        name     = "api-health"
        path     = "/health"
        interval = "10s"
        timeout  = "3s"
      }
    }

    task "web" {
      driver = "docker"
      constraint {
        attribute = "${attr.cpu.arch}"
        value = "amd64"
      }

      config {
        image = "hashicorpnomad/counter-api:v3"
      }
    }
  }

  group "dashboard" {
    network {
      mode = "bridge"

      port "http" {
        static = 9002
        to     = 9002
      }
    }

    service {
      name = "count-dashboard"
      port = "9002"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "count-api"
              local_bind_port  = 8080
            }
          }
        }
      }
    }

    task "dashboard" {
      driver = "docker"
      constraint {
        attribute = "${attr.cpu.arch}"
        value = "amd64"
      }
      env {
        COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
      }

      config {
        image = "hashicorpnomad/counter-dashboard:v3"
      }
    }
  }
}

Turns out, I’m an idiot.

I do have a functional Consul Connect example. I forgot to enable the gRPC port on consul.

However, I do still have this warning:

- WARNING: Failed to place all allocations.
  Task Group "api" (failed to place 1 allocation):
    * Constraint "${attr.cpu.arch} = amd64": 1 nodes excluded by filter
    * Resources exhausted on 1 nodes
    * Dimension "network: no addresses available for \"\" network" exhausted on 1 nodes

And it wants to replace the network every nomad job plan. Perhaps because it’s dynamic?

Hi @BeepDog, just FYI recent versions of Nomad can run Connect things on arm64, including the demo docker images starting with the :v3 tag.

I suspect you are seeing the network error because the Nomad Client wasn’t able to detect a usable default network interface.

That’s cool that the demos work there.

However, I do have allocations happening on these clients. I would expect that to not work at all, if it couldn’t detect a usable default interface…