New Learn Tutorials: Get Started with Service Mesh Observability and Traffic Management

paladin-devops · November 4, 2021, 2:32am

@DerekStrickland after a little more persistence and digging, I’ve got it working! This blog post helped me out a lot:

Also, I was running Consul 1.10.1, but the fix for the issue below was backported to 1.10 in version 1.10.2. After upgrading my client, I was able to successfully scrape Envoy metrics from Prometheus!

github.com/hashicorp/consul

Setting envoy admin IP to anything besides 127.0.0.1 breaks prometheus metrics

opened 09:42PM - 01 Aug 21 UTC

closed 12:10AM - 10 Aug 21 UTC

alexdulin

type/bug good first issue theme/consul-nomad

#### Overview of the Issue Running an envoy proxy with `consul connect envoy`… and specifying the `-admin-bind` to an IP that is not `127.0.0.1` breaks prometheus metrics because the `self_admin` cluster does not receive the correct IP for the admin listener - it will always be `127.0.0.1`, regardless of what the `consul connect envoy` command specified. This makes it impossible to bind the admin listener to an IP other than `127.0.0.1` and be able to correctly scrape prometheus metrics. My guess is because the IP is [hard-coded into the bootstrap command](https://github.com/hashicorp/consul/blob/v1.10.1/command/connect/envoy/bootstrap_config.go#L604) and cannot be changed, regardless of what the admin bind flag was set to. This problem was discovered due to a recent "bug fix" in Nomad that results in the admin listener for envoy sidecars to bind to `127.0.0.2` instead of `127.0.0.1`: https://github.com/hashicorp/nomad/pull/10883. The issue makes it impossible to use Nomad 1.1.3 and collect prometheus metrics from envoy. #### Reproduction Steps 1. Start a local consul agent ```shell consul agent -dev ``` 2. In a second terminal, run the following: ```shell /bin/cat <<"EOM" | consul config write - Kind = "proxy-defaults" Name = "global" Config { protocol = "http" envoy_prometheus_bind_addr = "0.0.0.0:9114" } EOM consul connect envoy \ -admin-bind=127.0.0.2:19002 \ -address=127.0.0.1:19001 \ -gateway=mesh \ -register ``` 3. In a third terminal, get the listeners on the envoy proxy with: `curl -s 127.0.0.2:19002/listeners`. This should show it registered a prometheus listener with output like the following ``` envoy_prometheus_metrics_listener::0.0.0.0:9114 default:127.0.0.1:19001::127.0.0.1:19001 ``` 4. However, the upstream cluster for `self_admin` will have the wrong IP of `127.0.0.1`, not `127.0.0.2`. Running `curl -s 127.0.0.2:19002/clusters | grep self_admin | sort` confirms this with output like the following: ```shell self_admin::127.0.0.1:19002::canary::false self_admin::127.0.0.1:19002::cx_active::0 self_admin::127.0.0.1:19002::cx_connect_fail::0 self_admin::127.0.0.1:19002::cx_total::0 self_admin::127.0.0.1:19002::health_flags::healthy self_admin::127.0.0.1:19002::hostname:: self_admin::127.0.0.1:19002::local_origin_success_rate::-1.0 self_admin::127.0.0.1:19002::priority::0 self_admin::127.0.0.1:19002::region:: self_admin::127.0.0.1:19002::rq_active::0 self_admin::127.0.0.1:19002::rq_error::0 self_admin::127.0.0.1:19002::rq_success::0 self_admin::127.0.0.1:19002::rq_timeout::0 self_admin::127.0.0.1:19002::rq_total::0 self_admin::127.0.0.1:19002::sub_zone:: self_admin::127.0.0.1:19002::success_rate::-1.0 self_admin::127.0.0.1:19002::weight::1 self_admin::127.0.0.1:19002::zone:: ``` 5. And consequently, curling the prometheus listener with `curl -s localhost:9114/metrics` results in a 503: ``` upstream connect error or disconnect/reset before headers. reset reason: connection failure ``` ### Consul info for both Client and Server <details> <summary>Client info</summary> ``` agent: check_monitors = 0 check_ttls = 0 checks = 1 services = 1 build: prerelease = revision = db839f18 version = 1.10.1 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = true leader_addr = 127.0.0.1:8300 server = true raft: applied_index = 77 commit_index = 77 fsm_pending = 0 last_contact = 0 last_log_index = 77 last_log_term = 2 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:1c8a1e81-16d4-86a6-bd21-2af1a0a4de76 Address:127.0.0.1:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 2 runtime: arch = amd64 cpu_count = 8 goroutines = 131 max_procs = 8 os = linux version = go1.16.6 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 1 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1 ``` </details> ### Operating system and Environment details `envoy --version` ``` envoy version: 98c1c9e9a40804b93b074badad1cdf284b47d58b/1.18.3/clean-getenvoy-b76c773-envoy/RELEASE/BoringSSL ```

Thank you for considering to take another look into this. I still think a Learn guide would be helpful to Nomad users, though!

Good luck with Nomad!