Connection refused on sidecar proxy

Hello guys,

I’m facing an issue with sidecar proxy in a cluster with TLS enabled. In a situation, where I try to deploy a service, which should be connected via terminating gateway to a service, which is outside the service mesh. I have registered an external service, then I have deployed a job with a terminating gateway service and with my service which i want to deploy with a sidecar proxy.

Job.hcl
  datacenters = ["dc1"]
  type = "service"

  group "gateway" {
    network {
      mode = "bridge"
    }

    service {
      name = "sso-gateway"

      connect {
        gateway {
          proxy {}
          }
          terminating {
            service {
              name = "sso"
            }
          }
        }
        sidecar_task {
          config {
            image = "xxxxxxxxxxx/library/envoy"
          }
        }
      }
    }
  }

  group "testaccount1" {
    count = 1

    network {
      mode = "bridge"
      port "http" {
        to = 8080
        static = 8080
      }
    }

    service {
      name = "testaccount1"
      port = "http"
      provider = "consul"

      connect {
 	    sidecar_service {
	        proxy {
 	            upstreams {
	                destination_name = "sso"
                        local_bind_port = 443
                   }
               }
        }
        sidecar_task {
          config {
	        image = "xxxxxxxxxx/library/envoy"
          }
        }
      }
    }
    task "testaccount1" {
      driver = "docker"
      env {
      }
      config {
        image = "xxxxxxxxx/account"
        ports = ["http"]

        auth {
          username = xxxxx
          password = xxxxx
        }
      }
    }
  }
}

This snippet is able to deploy terminating gateway and my specific service with its sidecar proxy. Consul’s health check on that sidecar proxy is giving me an error dial tcp 10.4.5.26:25299: connect: connection refused. In an envoy sidecar logs i can see this

envoy logs
[2023-10-03 13:32:50.415][1][info][admin] [source/server/admin/admin.cc:66] admin address: 127.0.0.2:19001
[2023-10-03 13:32:50.416][1][info][config] [source/server/configuration_impl.cc:131] loading tracing configuration
[2023-10-03 13:32:50.416][1][info][config] [source/server/configuration_impl.cc:91] loading 0 static secret(s)
[2023-10-03 13:32:50.416][1][info][config] [source/server/configuration_impl.cc:97] loading 1 cluster(s)
[2023-10-03 13:32:50.467][1][info][config] [source/server/configuration_impl.cc:101] loading 0 listener(s)
[2023-10-03 13:32:50.467][1][info][config] [source/server/configuration_impl.cc:113] loading stats configuration
[2023-10-03 13:32:50.468][1][info][runtime] [source/common/runtime/runtime_impl.cc:463] RTDS has finished initialization
[2023-10-03 13:32:50.468][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:221] cm init: initializing cds
[2023-10-03 13:32:50.468][1][warning][main] [source/server/server.cc:802] there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections
[2023-10-03 13:32:50.469][1][info][main] [source/server/server.cc:923] starting main dispatch loop
[2023-10-03 13:33:29.302][1][warning][config] [./source/common/config/grpc_stream.h:191] DeltaAggregatedResources gRPC config stream to local_agent closed since 38s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: No such file or directory
[2023-10-03 13:33:45.667][1][warning][config] [./source/common/config/grpc_stream.h:191] DeltaAggregatedResources gRPC config stream to local_agent closed since 55s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: No such file or directory
[2023-10-03 13:34:08.535][1][warning][config] [./source/common/config/grpc_stream.h:191] DeltaAggregatedResources gRPC config stream to local_agent closed since 78s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: No such file or directory
[2023-10-03 13:34:16.799][1][warning][config] [./source/common/config/grpc_stream.h:191] DeltaAggregatedResources gRPC config stream to local_agent closed since 86s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: No such file or directory
[2023-10-03 13:34:17.366][1][warning][config] [./source/common/config/grpc_stream.h:191] DeltaAggregatedResources gRPC config stream to local_agent closed since 86s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: No such file or directory

With those last messages in log above I started thinking that grpc is not working as it should. I have a TLS enabled in nomad and same with consul.

nomad server config
datacenter = "dc1"
data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"

server {
  enabled = true
  bootstrap_expect = 3
  encrypt = "xxxxxxxxxx"
}

tls {
  http = true
  rpc  = true

  ca_file   = "/etc/pki/nomad/nomad-agent-ca.pem"
  cert_file = "/etc/pki/nomad/global-server-nomad.pem"
  key_file  = "/etc/pki/nomad/global-server-nomad-key.pem"

  verify_server_hostname = true
  verify_https_client    = true
}

client {
  enabled = false
}

consul {
  address = "127.0.0.1:8501"
  token = "xxxxxxxxxxxxx"
  grpc_ca_file = "/etc/pki/consul/consul-agent-ca.pem"
  grpc_address = "127.0.0.1:8503"
  ca_file      = "/etc/pki/consul/consul-agent-ca.pem"
  cert_file    = "/etc/pki/consul/dc1-server-consul-1.pem"
  key_file     = "/etc/pki/consul/dc1-server-consul-1-key.pem"
  ssl          = true
}

acl {
  enabled = true
}
consul server config
data_dir = "/opt/consul"

node_name = "server2"

client_addr = "0.0.0.0"
bind_addr = "10.4.5.22"
advertise_addr = "10.4.5.22"

encrypt = "xxxxxxxxxxxxxxxxx"
encrypt_verify_incoming = true
encrypt_verify_outgoing = true

ui_config {
  enabled = true
}

rejoin_after_leave = true

verify_incoming = true
verify_outgoing = true
verify_server_hostname = true
ca_file = "/etc/pki/consul/consul-agent-ca.pem"
cert_file = "/etc/pki/consul/dc1-server-consul-1.pem"
key_file = "/etc/pki/consul/dc1-server-consul-1-key.pem"

ports = {
  https = 8501
  http = 8500
  grpc = 8502
  grpc_tls = 8503
  dns = -1
}

acl {
  enabled = true
  default_policy = "deny"
  tokens {
    default = "xxxxxxxxxxxxx"
  }
}


server = true
bootstrap_expect = 3

log_level = "DEBUG"
log_file = "/var/log/consul/"
log_rotate_max_files = 30
used versions
Nomad v1.6.2
BuildDate 2023-09-13T16:47:25Z
Revision 73e372ad94033db2ceaf53468b270a31544c23fd
Consul v1.16.2
Revision 68f81912
Build Date 2023-09-19T19:29:18Z

I’m not sure what could be wrong in my case.

Best Regards

Hi @Luke_b,

The error you are seeing is caused when envoy is unable to find the socket file (/alloc/tmp/consul_grpc.sock ) that is used to talk to the Consul xDS port.

Could you please verify whether this is the case by running this command?

# replace the alloc-id with the affected allocation id
$ nomad fs <alloc-id> alloc/tmp
Mode        Size  Modified Time              Name
Srwxrwxrwx  0 B   2023-10-06T14:12:15+11:00  consul_grpc.sock

In your case, you probably won’t find this .sock file, which would explain the error you are seeing. If this is the case, we must figure out why the socket file is missing.

Is this just a one-off issue, or do you have other jobs that also run into the same issue?

Hi,

Im really sorry for my late response.

When I tried to reproduce this issue in our unsecured cluster i found out that everything works as expected when default envoy image is pulled. After that I checked our private docker registry what kind of envoy image we use and envoy:distroless has been spotted. So you were on the right track to solve this issue.

Thank you

1 Like