Session getting disconnected again and again

My connections to boundary are getting dropped again and again and i am getting this error in controller logs. Any ideas?

{“id”:“uxpVjCWeGo”,“source”:“",“specversion”:“1.0”,“type”:“error”,“data”:{“error”:"server.(WorkerAuthRepositoryStorage).storeNodeInformation: db.Create: duplicate key value violates unique constraint "worker_auth_authorized_pkey": unique constraint violation: integrity violation: error #1002”,“error_fields”:{“Code”:1002,“Msg”:“”,“Op”:“server.(WorkerAuthRepositoryStorage).storeNodeInformation”,“Wrapped”:{“Code”:1002,“Msg”:“”,“Op”:“db.Create”,“Wrapped”:{“Code”:1002,“Msg”:“duplicate key value violates unique constraint "worker_auth_authorized_pkey"”,“Op”:“”,“Wrapped”:{“Code”:1002,“Msg”:“unique constraint violation”,“Op”:“”,“Wrapped”:null}}}},“id”:“e_JuAjlwWFY8”,“version”:“v0.1”,“op”:“server.(WorkerAuthRepositoryStorage).storeNodeInformation”},“datacontentype”:“application/cloudevents”,“time”:“2023-12-06T06:47:22.028584534Z”}

We’ve seen this before although not in recent version of Boundary – did you upgrade?

Can you try deleting the worker from the controller and re-authorizing it and see if that helps?

That helps, but again after sometime i am getting same issue. Just an fyi, i am using 0.13.0.

I believe you are actually hitting bug(workerAuth): allow duplicate workerAuth inserts if records match by irenarindos · Pull Request #3389 · hashicorp/boundary · GitHub which was fixed in 0.13.1 (you may have to remove/re-add the worker after upgrade).

thanks for the reply @jeff
I upgraded it to 0.14.3, but still i am facing issue where my workers just stop working. for a workaround i am just restarting workers every hour.
Any help would be appreciated. thanks

Is it the same error message?

The one you specified in the original post would only be an issue at registration or credential rotation time.

What do the logs from your workers look like?

Yup i don’t see that error right now. I don’t actually see any error logs right now but for some reason it just stops working.

Do you have any firewalls that might be dropping persistent connections? I unfortunately don’t have a lot of advice to give if you’re not seeing any errors on either the controller or worker logs.

I dont have firewalls on my servers. Can you suggest any steps which i can do when it hangs, which might give you some clue about the error?
Also just an fyi i have boundary on my k8s setup with 3 workers and 3 controllers.

I’m not sure if Kube networking might have some part here, but if the worker totally wedges, you can try sending a SIGQUIT (on the console you can do this with Ctrl-\) and sending a link to the output and we can see if there seems to be a deadlock somewhere.

this is the output in logs that i got

That looks like it’s partial, was that the full output? Did your terminal not have enough scrollback configured to hold the whole output perhaps?

I think this is the full output, I will check once more.

@jeff I think this will have full data

Thanks for your help :pray:

Sorry for the delay, I’ve been on PTO.

In the log I see a few things:

  • It appears that there is at least one connection that is in the process of being proxied; nothing seems to have interrupted it
  • At the same time, the worker is trying to drain connections, which is what would happen if the worker was in the process of being shut down, but is stalled there
  • There is another stall - a connection being made upstream that is stuck in the TLS handshaking process

I hate to ask you to upgrade again but we just saw someone else having a few behaviors that are extremely similar and it was due to some undocumented gRPC behavior that we have worked around in Fix issue with workers connecting in high latency conditions by jefferai · Pull Request #4535 · hashicorp/boundary · GitHub (despite the title, very high latencies are not needed to see this, 25ms or so is enough). I’m not 100% convinced it will help you but it very well may - if you can upgrade to 0.15.3 and see if that helps it would be great.

One other thing, we’ve identified an issue that could prevent the listener from being closed properly and are working on fixing it…another thing that may be related to this based on my analysis. Once 0.16.1 is out it may be relevant to your issue.