Connect Confusion in Nomad

Hi, I have some questions and confusion around configuring Connect jobs in Nomad with Consul TLS enabled I’m hoping someone can help with.

First, to keep this post short but provide enough context, here is a gist with Consul agent config, Nomad client config, and a few example Nomad dashboard example connect jobs using bridge mode networking and which provide ingress traefik in different manners to the connect job: gist link.

As mentioned in the comments in the gist, the Hashicorp app versioning for these tests are:

Nomad 1.1.3
Consul 1.10.1 (patched for consul issue 10714)

In the gist, the dashboard-works.hcl file launches a consul connect job in Nomad successfully and utilizes a traefik instance running on another host to provide ingress traefik to the dashboard Connect job.

Questions and points of confusion around this dashboard-works.hcl job:

  • Connect spawns an envoy container for non-native connect jobs, and envoy by default tries to contact the Consul API at a 127.0.0.1 listener within the envoy container. Most documentation I’ve read around Nomad and Connect jobs doesn’t override the default address for the consul listener, but I’ve been unable to bind a consul api listener inside the container regardless of what combination of consul addresses.http, bind_addr or client_addr config I use. Shouldn’t I be able to do this so I don’t need to override the CONSUL_HTTP_ADDR and CONSUL_GRPC_ADDR params inside the container to point to the host IP instead, like I’m currently doing in the gist files?

  • The only way this job currently works is by injecting a Consul root token into the Nomad agent config, which is obviously not desirable. I have read in some issues that injecting an env var of CONSUL_HTTP_TOKEN with sufficient privileges will also work (ie: the same env stanza as the CONSUL_HTTP_ADDR overrides mentioned above), but setting the Consul master token there does not allow the job to work. When the job does not work because a Consul master token is not provided, various gRPC xDS errors are logged to both the consul agent and envoy container stderr, respectively, seen in this gist comment. So I’m trying to figure out what permissions are required to make this job work without having to inject a Consul master token. I tried setting the consul default token to have as permissive policies as possible to determine what permission is needed, but it still throws errors even when allowing write to “” of agent, event, key, node, query, service, session prefixes. Any help figuring out what acl config is needed or being overlooked would be awesome!

Finally, after getting the basic dashboard Connect job to work with the notes mentioned above, I wanted to use a traefik ingress from a standalone host running traefik as a systemd service to provide load to the connect dashboard job. I was able to get this working as shown in the dashboard-works.hcl job by creating an ingress Consul service tagged for traefik that specifically sets the http router to use the dashboard connect consul service. This works, but I don’t think is actually the way the traefik/Nomad-connect integration is intended to work.

I think the intended way is to do as shown in dashboard-direct.hcl where the count-dashboard service simply sets it’s own traefik Host tag. However, this results in 404. Similarly, trying an embedded traefik ingress doesn’t fully work either. More details in the first comment of the gist. I’m confused on why these don’t seem to be working properly.

Comments on any of these issues or points of confusion are welcome, thanks!
John

1 Like

Thanks to @blake for pointing out the issue and patch for the dashboard-embedded.hcl websockets problems, now solved after patching. One point of confusion gone! :slight_smile:

Resolved most of the issues mentioned above with the following changes from the PR here:

  • Patches consul envoy http protocol upgrade issue (modified patch from consul PR: 9639).
  • Patches consul 1.10.1 for connect listener issue (patch from consul PR 10714).
  • Configs nomad clients to use consul TLS for connect (eliminates passing TLS certs to envoy sidecar and also seemed to fix the dashboard-direct.hcl standalone traefik ingress to dashboard service problem).

Now just one issue seems to remain:

  • We specifically avoid passing a consul token into nomad clients because of nomad issue 9813 and instead make use of the consul default token which is utilized when the nomad consul token config is a blank string.
  • In the case of connect jobs, the consul default token does not get utilized in creating the envoy_bootstrap.json script and injection of the corresponding consul service identities, resulting in the Error handling ADS delta stream errors mentioned above.
  • If a consul default token is provided in the nomad client config consul stanza, it the connect job works as expected. This seems to be a break from the expected behavior of consul default token being utilized where needed when no token is provided in the nomad client config consul stanza.
  • A short term workaround is to manually inject a token into the envoy bootstrap configuration. An example of this is the dashboard-token-injection.hcl file, which is a bit hacky, but works.
  • Trying to avoid this workaround, and instead inject a consul token into the env with CONSUL_HTTP_TOKEN was also attempted in different ways (at the job level, at the task level, from within the envoy sidecar env), didn’t work: those env vars don’t get utilized for the consul mesh service identity creation of the si_token file.