We are running Boundary in a production environment and have recently seen a higher frequency of the following issue:
Sep 28 22:17:44 p1lboundwork02.novalocal boundary[11891]: {“@level”:“error”,“@message”:“event.WriteError: unable to write error: status error grace period has expired, canceling all sessions on worker”,“@timestamp”:“2022-09-28T22:17:44.122956Z”}
We believe this to be causing silent disconnects for our users (from the client there is no indication of a dropped connection, the port seems to stay bound by boundary but errors are thrown from clients that do anything more than a simple TCP ping). It was something we believed to be related to database resource throttling but as of upgrading to include the db deadlock fix, we aren’t seeing the same resource contention in the Postgres databases, but the frequency of “all sessions” being canceled has increased… in correspondence with the number of users.
We are also seeing these errors somewhat regularly from our workers:
Sep 28 22:17:44 p1lboundwork02.novalocal boundary[11891]: {“@level”:“error”,“@message”:“encountered an error sending an error event”,“@timestamp”:“2022-09-28T22:17:44.122908Z”,“error:”:“event.(Eventer).retrySend: failed to send event: 2 errors occurred:\n\t* event.(Eventer).retrySend: event not written to enough sinks\n\t* context deadline exceeded\n\n”}
Sep 28 22:17:44 p1lboundwork02.novalocal boundary[11891]: {“@level”:“error”,“@message”:“event.WriteError: event.(Eventer).writeError: event.(Eventer).retrySend: failed to send event: 2 errors occurred:\n\t* event.(Eventer).retrySend: event not written to enough sinks\n\t* context deadline exceeded\n\n”,“@timestamp”:“2022-09-28T22:17:44.122937Z”}
Sep 28 22:17:44 p1lboundwork02.novalocal boundary[11891]: 2022-09-28T22:17:44.123Z [ERROR] error event: id=e_Uy01YNq3t3 version=v0.1 op=“worker.(Worker).sendWorkerStatus” info:msg=“error making status request to controller” error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”
Help with any one of these errors or pointing us in the right direction to catalog resources and other potential root causes would be extremely appreciated.