Nomad countdash tutorial disconnected troubleshooting

Having a newly configured consul/nomad cluster with a nomad node next to it, running consul-client in a docker container. “Things” seem to be working, in that I can run a couple of fakeservice instances and configure them, using curl to access the respective ports per allocation. The consul client is configured using auto_config and tls, but has grpc and http open as well.

Running the countdash demo project from Consul Service Mesh | Nomad | HashiCorp Developer which starts, but when trying to access the dashboard at :9002 it says that it is disconnected.

Have been trying to look at the environment, but the NOMAD_UPSTREAM_ADDR_count_api is opaque to me, and the mapping inside of the container to seems very magical. The troubleshooting section at the end of the tutorial has given no leads. How would I go about troubleshooting what’s going on and learning a bit more about how the service discovery parts actually work?

Hi @MWinther I suspect you may be running into Consul Connect sidecar proxies require additional configuration for gRPC-TLS listener · Issue #15360 · hashicorp/nomad · GitHub - where Consul made a backwards incompatible change to how grpc TLS connections are handled. On the Nomad side you’ll now need to set consul.grpc_ca_file and point grpc_address at the 8503 grpc tls port.

Hey @seth.hoenig and thanks for replying! It seems, from what I can tell from that ticket, that it should be solved using 1.5.0 of Nomad. Does that still require the consul.grpc_ca_file setting?

In the meantime, I have set the grpc_address to the 8502 port and reopened the non-tls port in the Consul end, but I am still having no luck connecting the demo app. The logs from the envoy docker images seem to pick up the changes though, so I am not clear on what else to look for. Any ideas?

Nomad 1.5 is what makes setting consul.grpc_ca_file possible - we had to add that in order to work with Consul 1.14+.

I probably wouldn’t try to make use of 8502 / non-TLS at this point if Consul is setup with TLS - it’s just more confusing and less secure.

Just to make sure, you’re running

Nomad 1.5.0+
Consul 1.14.1+

And your Nomad client config contains a consul block with a minimum,

consul {
  grpc_ca_file = "path/to/<consul-agent-ca>.pem"
  ca_file      = "path/to/<consul-agent-ca>.pem"
  cert_file    = "path/to/<client-consul-cert>.pem"
  key_file     = "path/to/<client-consul-cert-key>.pem"
  ssl          = true
  address      = ""
  grpc_address = ""

Also do you have Consul ACLs enabled? If so you’ll need to setup ACL Intentions to allow the services to communicate (or set default_policy = "allow" ).

Hi @seth.hoenig , thanks for your quick reply!

So, when I tried setting the grpc_ca_file to my actual CA earlier, I was informed that it didn’t approve. I guess that is because since I am using auto_config with my consul I don’t get the complete certificate chain out to the client, and it has no way of verifying the intermediate step compared to the consul-issued cert at hand.

Is there any way to request the consul CA cert used for it dynamically in this case, since I rotate the consul connect CA quite often, and haven’t gotten to the point where I use templates to provide the auto-rotation functionality on the clients just yet, or is it time for me to make a quick detour into the template and auto-rotation part of the equation first?

A little bit more information after some further experimentation. I went back and tried to use my host certificate for the nomad node, and received the following error after adding CA, cert and key to the consul section of the nomad.hcl config file.

  address = ""
  grpc_address = ""
  ca_file = "/etc/pki/ca-file.pem"
  grpc_ca_file = "/etc/pki/ca-file.pem"
  cert_file = "/etc/pki/tls/certs/hostcert.pem"
  key_file = "/etc/pki/tls/private/hostkey.pem"

This made the countdash demo deploy properly, but no functionality, instead the following error from the docker logs of the envoy-proxy:

[2023-03-08 21:21:25.367][1][warning][config] [./source/common/config/grpc_stream.h:201] DeltaAggregatedResources gRPC config stream to local_agent closed since 53s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435581:SSL ro utines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED

After changing the consul client config tls stanza as follows (that is, setting verify_server_hostname = false

  defaults {
    verify_incoming = false
    verify_outgoing = true
    ca_file = "/certs/ca_file.pem"
  internal_rpc {
    verify_incoming = false
    verify_server_hostname = false

I instead get the error:

[2023-03-08 21:35:43.944][1][warning][config] [./source/common/config/grpc_stream.h:201] DeltaAggregatedResources gRPC config stream to local_agent closed since 67s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435703:SSL ro utines:OPENSSL_internal:WRONG_VERSION_NUMBER

I am fine with having the certificate name problem for now, this is still a lab environment, but I am at a loss as to what to do in next. Any ideas?