Hi all,
I’m joining this thread with hope that someone can help us with resolving this mystery or at least give us some directions, suggestions for our investigation.
To add some details and background on when we see those errors…
We’re using Consul, with Connect enabled - servers and agents currently on version 1.12.9.
We launch services with multiple tasks/instances in ECS, with dedicated agent and Envoy for each task/application instance, and recently we started having in some services frequent troubles with launching instances (nothing regular, with some services occurring more often, with some less - all services using the same consul registration setup, though).
Sometimes they launch without issues, sometimes they struggle, being recycled once or twice before they launch properly.
What’s interesting in case of those struggling instances is what Pawan shared above - in consul agent’s logs we see those warnings and errors, like:
[WARN] agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '' not in agent state" index=0
[ERROR] agent.proxycfg: Failed to handle update from watch: kind=connect-proxy proxy=<service_name>-proxy service_id=<service_name>-proxy id=service-http-checks: error="error filling agent cache: Internal cache failure: service '' not in agent state"
What we also see in these cases - we’re also getting troubles with Envoy proxy connecting to consul agent within the same task (hence localhost), getting “connection refused” like below:
Error connecting to Consul agent: Get "http://127.0.0.1:8500/v1/agent/self": dial tcp 127.0.0.1:8500: connect: connection refused
failed fetch proxy config from local agent: Get "http://127.0.0.1:8500/v1/agent/service/<service_name>-proxy": dial tcp 127.0.0.1:8500: connect: connection refused
In such cases we have also application trying and failing with access to Consul’s K/V store - also with connection refused by consul agent, locally in the task.
And eventually such a task dies, making ECS relaunch it - which sometimes works in the first attempt, sometimes only after few attempts. There seem to be no clear rule about it.
Wild guess - might we be hitting some performance issues, limits with Consul registering agents/services for our ECS tasks?
Since this is kind of a new experience for us, with such troubles - not sure what to make out of those messages saying “not in agent state”.
Also unclear to me why the service name in cache failure error is an empty string.
Not sure if what we’re seeing in those logs is the reason of our troubles or rather a result of some other issues, which I don’t really clearly see for now.
From the logs - I see the agents connecting to the cluster properly, finding server instances.
Also from the servers perspective - I see the agent connecting, with join event logged - and later after a while disconnecting with agent leave event.
Would that mean that our troubles are related rather to the task’s local consul agent?
Any suggestions are very welcome.
Kind regards
Jacek