Hi there,
I’m currently trying to get Consul tracing working with Nomad in our staging cluster.
Design:
We run otel-collector as a system job
We create the Consul ConfigEntry via api. This sets up zipkin tracing pointing to the otel-collector consul dns entry.
When I run this on my workstation, I get traces going to the otel-collector. When I run this in our staging cluster which has 4+ Nomad agents, I can see that traces are generated in the consul sidecar, but the traces don’t appear to be sent. If I exec into the Nomad alloc for the consul sidecar, I can send traces via curl successfully so it’s not a firewall issue.
I did see this in the consul sidecar logs:
[2022-09-28 14:08:15.672][14][debug][pool] [source/common/conn_pool/conn_pool_base.cc:240] [C7066] using existing connection
[2022-09-28 14:08:15.672][14][debug][pool] [source/common/conn_pool/conn_pool_base.cc:177] [C7066] creating stream
[2022-09-28 14:08:15.672][14][debug][router] [source/common/router/upstream_request.cc:416] [C7067][S2835708221056279607] pool ready
[2022-09-28 14:08:15.673][14][debug][client] [source/common/http/codec_client.cc:129] [C7066] response complete
[2022-09-28 14:08:15.673][14][debug][router] [source/common/router/router.cc:1285] [C7067][S2835708221056279607] upstream headers complete: end_stream=true
[2022-09-28 14:08:15.673][14][debug][http] [source/common/http/conn_manager_impl.cc:1467] [C7067][S2835708221056279607] encoding headers via codec (end_stream=true):
':status', '304'
'last-modified', 'Tue, 27 Sep 2022 17:41:17 GMT'
'etag', '"202fca5:238:633335bd:0"'
'content-security-policy-report-only', 'img-src 'self'; script-src-elem 'self' https://accounts.google.com/gsi/client; frame-src https://accounts.google.com/gsi/; connect-src 'self' https://accounts.google.com/gsi/; frame-ancestors 'self'; form-action 'self';'
'accept-ranges', 'bytes'
'content-disposition', 'inline; filename="index.html"'
'content-type', 'text/html; charset=utf-8'
'date', 'Wed, 28 Sep 2022 14:08:15 GMT'
'x-envoy-upstream-service-time', '0'
'server', 'envoy'
[2022-09-28 14:08:15.673][14][debug][pool] [source/common/http/http1/conn_pool.cc:53] [C7066] response complete
[2022-09-28 14:08:15.673][14][debug][pool] [source/common/conn_pool/conn_pool_base.cc:205] [C7066] destroying stream: 0 remaining
[2022-09-28 14:08:15.744][1][debug][dns] [source/common/network/dns_impl.cc:270] dns resolution for otel-collector-zipkin.service.consul started
[2022-09-28 14:08:15.746][1][debug][dns] [source/common/network/dns_impl.cc:188] dns resolution for otel-collector-zipkin.service.consul completed with status 0
[2022-09-28 14:08:15.746][1][debug][upstream] [source/common/upstream/upstream_impl.cc:256] transport socket match, socket default selected for host with address 10.3.221.245:9411
[2022-09-28 14:08:15.746][1][debug][upstream] [source/common/upstream/upstream_impl.cc:256] transport socket match, socket default selected for host with address 10.3.145.128:9411
[2022-09-28 14:08:15.746][1][debug][upstream] [source/common/upstream/upstream_impl.cc:256] transport socket match, socket default selected for host with address 10.3.200.166:9411
[2022-09-28 14:08:15.746][1][debug][upstream] [source/common/upstream/strict_dns_cluster.cc:177] DNS refresh rate reset for otel-collector-zipkin.service.consul, refresh rate 5000 ms
[2022-09-28 14:08:16.493][14][debug][conn_handler] [source/server/active_tcp_listener.cc:140] [C7068] new connection from 10.3.177.106:56460
[2022-09-28 14:08:16.494][14][debug][connection] [source/common/network/connection_impl.cc:249] [C7068] closing socket: 0
[2022-09-28 14:08:16.494][14][debug][conn_handler] [source/server/active_stream_listener_base.cc:120] [C7068] adding to cleanup list
[2022-09-28 14:08:20.452][14][debug][connection] [source/common/network/connection_impl.cc:640] [C7066] remote close
[2022-09-28 14:08:20.452][14][debug][connection] [source/common/network/connection_impl.cc:249] [C7066] closing socket: 0
[2022-09-28 14:08:20.452][14][debug][client] [source/common/http/codec_client.cc:106] [C7066] disconnect. resetting 0 pending requests
[2022-09-28 14:08:20.452][14][debug][pool] [source/common/conn_pool/conn_pool_base.cc:429] [C7066] client disconnected, failure reason:
[2022-09-28 14:08:20.452][14][debug][pool] [source/common/conn_pool/conn_pool_base.cc:396] invoking idle callbacks - is_draining_for_deletion_=false
I’m wondering if there’s an issue with the dns entries being refreshed
Consul ConfigEntry:
{"Config":{"envoy_extra_static_clusters_json":"{\"name\": \"zipkin\", \"type\": \"STRICT_DNS\", \"connect_timeout\": \"5s\", \"load_assignment\": {\"cluster_name\": \"zipkin\", \"endpoints\": [{\"lb_endpoints\": [{\"endpoint\": {\"address\": {\"socket_address\": {\"address\": \"otel-collector-zipkin.service.consul\", \"port_value\": 9411}}}}]}]}}","envoy_stats_flush_interval":"10s","envoy_tracing_json":"{\"http\": {\"name\": \"envoy.tracers.zipkin\", \"typedConfig\": {\"@type\": \"type.googleapis.com/envoy.config.trace.v3.ZipkinConfig\", \"collector_cluster\": \"zipkin\", \"collector_endpoint_version\": \"HTTP_JSON\", \"collector_endpoint\": \"/api/v2/spans\", \"shared_span_context\": false, \"trace_id_128bit\": true}}}","prometheus_bind_addr":"0.0.0.0:9102","protocol":"grpc"},"Expose":{},"MeshGateway":{},"TransparentProxy":{}}
Questions:
- Is there a way to configure Consul tracing through Nomad, instead of having to use the API? Specifically I’d love to avoid the consul dns complexity and just use Nomad service discovery
- Are there any known consul dns limitations I may be hitting? Alternatively, is there a way to configure dns caching for the consul sidecars? I already have the following set up, but clearly its not being respected:
dns_config {
service_ttl {
"*" = "30s"
}
node_ttl = "15s"
}
- Any other advice that I should be following to get this working?
Thank you for all the help