Using Consul tracing with Nomad

Hi there,
I’m currently trying to get Consul tracing working with Nomad in our staging cluster.
Design:
We run otel-collector as a system job
We create the Consul ConfigEntry via api. This sets up zipkin tracing pointing to the otel-collector consul dns entry.

When I run this on my workstation, I get traces going to the otel-collector. When I run this in our staging cluster which has 4+ Nomad agents, I can see that traces are generated in the consul sidecar, but the traces don’t appear to be sent. If I exec into the Nomad alloc for the consul sidecar, I can send traces via curl successfully so it’s not a firewall issue.

I did see this in the consul sidecar logs:

[2022-09-28 14:08:15.672][14][debug][pool] [source/common/conn_pool/conn_pool_base.cc:240] [C7066] using existing connection
[2022-09-28 14:08:15.672][14][debug][pool] [source/common/conn_pool/conn_pool_base.cc:177] [C7066] creating stream
[2022-09-28 14:08:15.672][14][debug][router] [source/common/router/upstream_request.cc:416] [C7067][S2835708221056279607] pool ready
[2022-09-28 14:08:15.673][14][debug][client] [source/common/http/codec_client.cc:129] [C7066] response complete
[2022-09-28 14:08:15.673][14][debug][router] [source/common/router/router.cc:1285] [C7067][S2835708221056279607] upstream headers complete: end_stream=true
[2022-09-28 14:08:15.673][14][debug][http] [source/common/http/conn_manager_impl.cc:1467] [C7067][S2835708221056279607] encoding headers via codec (end_stream=true):
':status', '304'
'last-modified', 'Tue, 27 Sep 2022 17:41:17 GMT'
'etag', '"202fca5:238:633335bd:0"'
'content-security-policy-report-only', 'img-src 'self'; script-src-elem 'self' https://accounts.google.com/gsi/client; frame-src https://accounts.google.com/gsi/; connect-src 'self' https://accounts.google.com/gsi/; frame-ancestors 'self'; form-action 'self';'
'accept-ranges', 'bytes'
'content-disposition', 'inline; filename="index.html"'
'content-type', 'text/html; charset=utf-8'
'date', 'Wed, 28 Sep 2022 14:08:15 GMT'
'x-envoy-upstream-service-time', '0'
'server', 'envoy'

[2022-09-28 14:08:15.673][14][debug][pool] [source/common/http/http1/conn_pool.cc:53] [C7066] response complete
[2022-09-28 14:08:15.673][14][debug][pool] [source/common/conn_pool/conn_pool_base.cc:205] [C7066] destroying stream: 0 remaining
[2022-09-28 14:08:15.744][1][debug][dns] [source/common/network/dns_impl.cc:270] dns resolution for otel-collector-zipkin.service.consul started
[2022-09-28 14:08:15.746][1][debug][dns] [source/common/network/dns_impl.cc:188] dns resolution for otel-collector-zipkin.service.consul completed with status 0
[2022-09-28 14:08:15.746][1][debug][upstream] [source/common/upstream/upstream_impl.cc:256] transport socket match, socket default selected for host with address 10.3.221.245:9411
[2022-09-28 14:08:15.746][1][debug][upstream] [source/common/upstream/upstream_impl.cc:256] transport socket match, socket default selected for host with address 10.3.145.128:9411
[2022-09-28 14:08:15.746][1][debug][upstream] [source/common/upstream/upstream_impl.cc:256] transport socket match, socket default selected for host with address 10.3.200.166:9411
[2022-09-28 14:08:15.746][1][debug][upstream] [source/common/upstream/strict_dns_cluster.cc:177] DNS refresh rate reset for otel-collector-zipkin.service.consul, refresh rate 5000 ms
[2022-09-28 14:08:16.493][14][debug][conn_handler] [source/server/active_tcp_listener.cc:140] [C7068] new connection from 10.3.177.106:56460
[2022-09-28 14:08:16.494][14][debug][connection] [source/common/network/connection_impl.cc:249] [C7068] closing socket: 0
[2022-09-28 14:08:16.494][14][debug][conn_handler] [source/server/active_stream_listener_base.cc:120] [C7068] adding to cleanup list
[2022-09-28 14:08:20.452][14][debug][connection] [source/common/network/connection_impl.cc:640] [C7066] remote close
[2022-09-28 14:08:20.452][14][debug][connection] [source/common/network/connection_impl.cc:249] [C7066] closing socket: 0
[2022-09-28 14:08:20.452][14][debug][client] [source/common/http/codec_client.cc:106] [C7066] disconnect. resetting 0 pending requests
[2022-09-28 14:08:20.452][14][debug][pool] [source/common/conn_pool/conn_pool_base.cc:429] [C7066] client disconnected, failure reason: 
[2022-09-28 14:08:20.452][14][debug][pool] [source/common/conn_pool/conn_pool_base.cc:396] invoking idle callbacks - is_draining_for_deletion_=false

I’m wondering if there’s an issue with the dns entries being refreshed

Consul ConfigEntry:

{"Config":{"envoy_extra_static_clusters_json":"{\"name\": \"zipkin\", \"type\": \"STRICT_DNS\", \"connect_timeout\": \"5s\", \"load_assignment\": {\"cluster_name\": \"zipkin\", \"endpoints\": [{\"lb_endpoints\": [{\"endpoint\": {\"address\": {\"socket_address\": {\"address\": \"otel-collector-zipkin.service.consul\", \"port_value\": 9411}}}}]}]}}","envoy_stats_flush_interval":"10s","envoy_tracing_json":"{\"http\": {\"name\": \"envoy.tracers.zipkin\", \"typedConfig\": {\"@type\": \"type.googleapis.com/envoy.config.trace.v3.ZipkinConfig\", \"collector_cluster\": \"zipkin\", \"collector_endpoint_version\": \"HTTP_JSON\", \"collector_endpoint\": \"/api/v2/spans\", \"shared_span_context\": false, \"trace_id_128bit\": true}}}","prometheus_bind_addr":"0.0.0.0:9102","protocol":"grpc"},"Expose":{},"MeshGateway":{},"TransparentProxy":{}}

Questions:

  1. Is there a way to configure Consul tracing through Nomad, instead of having to use the API? Specifically I’d love to avoid the consul dns complexity and just use Nomad service discovery
  2. Are there any known consul dns limitations I may be hitting? Alternatively, is there a way to configure dns caching for the consul sidecars? I already have the following set up, but clearly its not being respected:
dns_config {
  service_ttl {
    "*" = "30s"
  }
  node_ttl = "15s"
}
  1. Any other advice that I should be following to get this working?

Thank you for all the help

For anyone who stumbles upon this:

  1. You can set up Consul tracing through Nomad, see Configure Envoy Tracing in Consul Connect Jobs – HashiCorp Help Center

The actual issue with why the traces weren’t being sent was because someone had added a conditional that removed the export endpoint if it wasn’t running locally :person_facepalming:

That said, I’ve started removing anything that uses Consul DNS in favor of Consul/Nomad Service Discovery.