Agent.proxycfg error on consul agent

Hi team,
We have a consul agent running as a sidecar container in our ECS system it was working fine until recently, but we are getting multiple errors from the consul agent.

We have 100+ services running but we can see these errors on a few of our services now and I am afraid it will slowly get to other services as well.

Errors:

[ERROR] agent.proxycfg: Failed to handle update from watch: kind=connect-proxy proxy=<service-name>-service-proxy service_id=<service-name>-proxy id=service-http-checks: error="error filling agent cache: Internal cache failure: service '' not in agent state"

agent.proxycfg: Failed to handle update from watch: kind=connect-proxy proxy=<service-name>-proxy service_id=<service-name>-service-proxy id=service-http-checks: error="error filling agent cache: Internal cache failure: service '' not in agent state"

agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '' not in agent state" index=0

[ERROR] agent.envoy: Error receiving new DeltaDiscoveryRequest; closing request channel: error="rpc error: code = Canceled desc = context canceled"

Could someone please help me to understand the issue and if you have solution for this?

1 Like

Hi all,

I’m joining this thread with hope that someone can help us with resolving this mystery or at least give us some directions, suggestions for our investigation.
To add some details and background on when we see those errors…

We’re using Consul, with Connect enabled - servers and agents currently on version 1.12.9.

We launch services with multiple tasks/instances in ECS, with dedicated agent and Envoy for each task/application instance, and recently we started having in some services frequent troubles with launching instances (nothing regular, with some services occurring more often, with some less - all services using the same consul registration setup, though).
Sometimes they launch without issues, sometimes they struggle, being recycled once or twice before they launch properly.

What’s interesting in case of those struggling instances is what Pawan shared above - in consul agent’s logs we see those warnings and errors, like:

[WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '' not in agent state" index=0

[ERROR] agent.proxycfg: Failed to handle update from watch: kind=connect-proxy proxy=<service_name>-proxy service_id=<service_name>-proxy id=service-http-checks: error="error filling agent cache: Internal cache failure: service '' not in agent state"

What we also see in these cases - we’re also getting troubles with Envoy proxy connecting to consul agent within the same task (hence localhost), getting “connection refused” like below:

Error connecting to Consul agent: Get "http://127.0.0.1:8500/v1/agent/self": dial tcp 127.0.0.1:8500: connect: connection refused

failed fetch proxy config from local agent: Get "http://127.0.0.1:8500/v1/agent/service/<service_name>-proxy": dial tcp 127.0.0.1:8500: connect: connection refused

In such cases we have also application trying and failing with access to Consul’s K/V store - also with connection refused by consul agent, locally in the task.

And eventually such a task dies, making ECS relaunch it - which sometimes works in the first attempt, sometimes only after few attempts. There seem to be no clear rule about it.

Wild guess - might we be hitting some performance issues, limits with Consul registering agents/services for our ECS tasks?

Since this is kind of a new experience for us, with such troubles - not sure what to make out of those messages saying “not in agent state”.
Also unclear to me why the service name in cache failure error is an empty string.

Not sure if what we’re seeing in those logs is the reason of our troubles or rather a result of some other issues, which I don’t really clearly see for now.

From the logs - I see the agents connecting to the cluster properly, finding server instances.
Also from the servers perspective - I see the agent connecting, with join event logged - and later after a while disconnecting with agent leave event.

Would that mean that our troubles are related rather to the task’s local consul agent?

Any suggestions are very welcome.

Kind regards
Jacek

1 Like

A small update to the case - turns out that the following messages appear also sometimes during “normal” operation for tasks that otherwise work healthy:

[WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '' not in agent state" index=0

[ERROR] agent.proxycfg: Failed to handle update from watch: kind=connect-proxy proxy=<service_name>-proxy service_id=<service_name>-proxy id=service-http-checks: error="error filling agent cache: Internal cache failure: service '' not in agent state"

So they may be that they’re not exactly/strongly related, as reason, to the launch troubles, but still something we’re looking to explain/understand about.

1 Like

Well - to conclude it a bit myself.
Still not sure why I see those cache and proxycfg related entries, which seem to relate to startup slowness of agent.
The problem for now has been kind of “worked around” with updated dependencies in ECS, by making sure that consul agent is started properly and responding, before other containers try to use it.