Nomad "Permission denied" error when trying to deploy example ingress-gateway

Hi

I’ve been trying to get the ingress gateway working - whilst I had it working on a simple vagrant cluster of machines with no ACLS, when I moved to DigitalOcean with ACLs enabled on consul, it no longer works

nomad = 1.1.2
consul = 1.10.0

> nomad operator raft list-peers
Node            ID                Address           State     Voter  RaftProtocol
nomad-3.global  10.106.0.10:4647  10.106.0.10:4647  follower  true   2
nomad-2.global  10.106.0.4:4647   10.106.0.4:4647   leader    true   2
nomad-1.global  10.106.0.7:4647   10.106.0.7:4647   follower  true   2
> consul operator raft list-peers
Node      ID                                    Address           State     Voter  RaftProtocol
consul-3  123869d6-ee08-be85-0a37-e07c084ea952  10.106.0.9:8300   follower  true   3
consul-1  475c3fbc-ff47-a4d5-561a-57fa0014da21  10.106.0.2:8300   leader    true   3
consul-2  b0ff6bfe-5b8d-b20a-37c3-9bc21f9bafd6  10.106.0.12:8300  follower  true   3

Consul has ACLs enabled and mTLS with a unique token provided for each nomad server/client node

> cat /etc/nomad.d/consul.hcl
consul {
  token     = "some token"
  ca_file   = "/opt/consul/tls/ca.crt"
}

and the permissions are

acl = "write"

agent_prefix "" {
  policy = "read"
}

agent "the-host" {
  policy = "write"
}

node_prefix "" {
  policy = "read"
}

node "the-host" {
  policy = "write"
}

service_prefix "" {
  policy = "write"
}

key_prefix "" {
  policy = "read"
}

session "the-host" {
  policy = "write"
}

Here’s the job file I used

> cat uuid.hcl
job "ig-bridge-demo" {
  datacenters = ["dc1"]
  group "ingress-group" {

    network {
      mode = "bridge"

      port "api" {
        static = 8080
        to     = 8080
      }
    }

    service {
      name = "my-ingress-service"
      port = "8080"

      connect {
        gateway {
          proxy {
          }
          ingress {
            listener {
              port     = 8080
              protocol = "tcp"
              service {
                name = "uuid-api"
              }
            }
          }
        }
      }
    }
  }

  group "generator" {
    network {
      mode = "host"
      port "api" {}
    }

    service {
      name = "uuid-api"
      port = "${NOMAD_PORT_api}"

      connect {
        native = true
      }
    }

    task "generate" {
      driver = "docker"

      config {
        image        = "hashicorpnomad/uuid-api:v3"
        network_mode = "host"
      }

      env {
        BIND = "0.0.0.0"
        PORT = "${NOMAD_PORT_api}"
      }
    }
  }
}

I added in the intentions (consul intention create my-ingress-service uuid-api) to link it up but get the following error

> nomad job run -verbose uuid.hcl
Error submitting job: Unexpected response code: 500 (rpc error: Unexpected response code: 403 (rpc error making call: rpc error making call: Permission denied))

I even tried it by passing in the bootstrap ACL token via -consul-token and setting full wildcard intentions but get the same error

When monitor the nomad logs the leader has nothing interesting and the machine the command is run on has a simple error message

LEADER > nomad monitor -log-level TRACE
2021-07-09T08:42:56.875Z [TRACE] nomad.job: job mutate results: mutator=canonicalize warnings=[] error=<nil>
2021-07-09T08:42:56.875Z [TRACE] nomad.job: job mutate results: mutator=connect warnings=[] error=<nil>
2021-07-09T08:42:56.875Z [TRACE] nomad.job: job mutate results: mutator=expose-check warnings=[] error=<nil>
2021-07-09T08:42:56.875Z [TRACE] nomad.job: job mutate results: mutator=constraints warnings=[] error=<nil>
2021-07-09T08:42:56.875Z [TRACE] nomad.job: job validate results: validator=connect warnings=[] error=<nil>
2021-07-09T08:42:56.875Z [TRACE] nomad.job: job validate results: validator=expose-check warnings=[] error=<nil>
2021-07-09T08:42:56.875Z [TRACE] nomad.job: job validate results: validator=validate warnings=[] error=<nil>
2021-07-09T08:42:56.875Z [TRACE] nomad.job: job validate results: validator=memory_oversubscription warnings=[] error=<nil>
FOLLOWER > nomad monitor -log-level TRACE
2021-07-09T08:42:56.882Z [ERROR] http: request failed: method=POST path=/v1/jobs error="rpc error: Unexpected response code: 403 (rpc error making call: rpc error making call: Permission denied)" code=500
2021-07-09T08:42:56.882Z [DEBUG] http: request complete: method=POST path=/v1/jobs duration=8.718807ms

Get the same error via the HTTP API

>  curl  -v --request POST --data @uuid.json \
      --cacert /opt/nomad/tls/ca.crt \
      --cert /opt/nomad/tls/agent.crt \
      --key /opt/nomad/tls/agent.key \
      "https://10.106.0.7:4646/v1/jobs"
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 10.106.0.7:4646...
* Connected to 10.106.0.7 (10.106.0.7) port 4646 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /opt/nomad/tls/ca.crt
*  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, CERT verify (15):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: CN=server.global.nomad
*  start date: Jul  8 10:42:08 2021 GMT
*  expire date: Aug  7 10:42:38 2021 GMT
*  subjectAltName: host "10.106.0.7" matched cert's IP address!
*  issuer: CN=global.nomad Intermediate Authority
*  SSL certificate verify ok.
> POST /v1/jobs HTTP/1.1
> Host: 10.106.0.7:4646
> User-Agent: curl/7.76.1
> Accept: */*
> Content-Length: 8019
> Content-Type: application/x-www-form-urlencoded
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Vary: Accept-Encoding
< Date: Fri, 09 Jul 2021 08:42:56 GMT
< Content-Length: 106
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host 10.106.0.7 left intact
rpc error: Unexpected response code: 403 (rpc error making call: rpc error making call: Permission denied)[root@nomad-1 jobs]# nomad members

The cert/keys are generate from the pki endpoint of a vault cluster

I also deployed the countdash example connect job and that works fine so connect seems to be working ok (envoy 1.18.3)

Any ideas on what config I’m missing to get ingress working?

The consul logs are also pretty uninteresting

LEADER > consul monitor -log-level TRACE
2021-07-09T08:34:06.710Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=4
2021-07-09T08:34:06.710Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=4
2021-07-09T08:34:13.031Z [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=10.106.0.11:46660
2021-07-09T08:34:22.955Z [DEBUG] agent: Skipping remote check since it is managed automatically: check=serfHealth
2021-07-09T08:34:22.955Z [DEBUG] agent: Node info in sync
2021-07-09T08:34:23.749Z [DEBUG] agent.server.memberlist.lan: memberlist: Initiating push/pull sync with: nomad-2 10.106.0.4:8301
2021-07-09T08:34:40.922Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=4
2021-07-09T08:34:40.922Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=4
2021-07-09T08:34:40.924Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=4
2021-07-09T08:34:40.924Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=4
FOLLOWER > consul monitor -log-level TRACE
2021-07-09T08:33:57.357Z [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=10.106.0.3:55244
2021-07-09T08:34:11.460Z [DEBUG] agent.server.memberlist.lan: memberlist: Initiating push/pull sync with: vault-1 10.106.0.5:8301
2021-07-09T08:34:15.897Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=4
2021-07-09T08:34:15.897Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=4
2021-07-09T08:34:17.158Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=4
2021-07-09T08:34:17.158Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=4
2021-07-09T08:34:24.205Z [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=10.106.0.10:45648
2021-07-09T08:34:29.098Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=4
2021-07-09T08:34:29.098Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=4
2021-07-09T08:34:33.028Z [TRACE] agent: ccResolverWrapper: sending update to cc: {[{10.106.0.2:8300 0 consul-1.dc1 <nil>}] <nil>}
2021-07-09T08:34:33.028Z [TRACE] agent: addrConn: tryUpdateAddrs curAddr: {10.106.0.2:8300 0 consul-1.dc1 <nil>}, addrs: [{10.106.0.2:8300 0 consul-1.dc1 <nil>}]
2021-07-09T08:34:33.028Z [TRACE] agent: addrConn: tryUpdateAddrs curAddrFound: true
2021-07-09T08:34:33.028Z [TRACE] agent: ccResolverWrapper: sending update to cc: {[{10.106.0.2:8300 0 consul-1.dc1 <nil>} {10.106.0.12:8300 0 consul-2.dc1 <nil>} {10.106.0.9:8300 0 consul-3.dc1 <nil>}] <nil>}
2021-07-09T08:34:33.028Z [TRACE] agent: addrConn: tryUpdateAddrs curAddr: {10.106.0.2:8300 0 consul-1.dc1 <nil>}, addrs: [{10.106.0.2:8300 0 consul-1.dc1 <nil>} {10.106.0.12:8300 0 consul-2.dc1 <nil>} {10.106.0.9:8300 0 consul-3.dc1 <nil>}]
2021-07-09T08:34:33.028Z [TRACE] agent: addrConn: tryUpdateAddrs curAddrFound: true
2021-07-09T08:34:33.030Z [DEBUG] agent.router.manager: Rebalanced servers, new active server: number_of_servers=3 active_server="consul-3.dc1 (Addr: tcp/10.106.0.9:8300) (DC: dc1)"
2021-07-09T08:34:33.792Z [DEBUG] agent.server.memberlist.wan: memberlist: Stream connection from=10.106.0.9:47762
2021-07-09T08:34:41.464Z [DEBUG] agent.server.memberlist.lan: memberlist: Initiating push/pull sync with: consul-3 10.106.0.9:8301

Forgot to mention that there are no ACLs setup yet on Nomad itself

Hi @docrozza sorry for the trouble. I suspect this is a case where the Nomad server agent needs a Consul token with operator = "write" permissions. When creating an ingress gateway, Nomad will automatically create the required Consul Configuration Entry of type ingress-gateway for the gateway, and Consul ACLs require that operator permission to do so.

Many thanks - your suspicion was exactly right!

Tweaked the policies with the new operator = "write", restarted and then the job ran perfectly

Even the resolvectl query uuid-api.ingress.dc1.consul and curl -v http://uuid-api.ingress.dc1.consul:8080 then worked - took me a while to figure out to use the uuid-api name rather than the my-ingress-service name

Many thanks again - always nice to get something working on a Friday afternoon