Consistently "cancelling all sessions on worker"

We are running Boundary in a production environment and have recently seen a higher frequency of the following issue:

Sep 28 22:17:44 p1lboundwork02.novalocal boundary[11891]: {“@level”:“error”,“@message”:“event.WriteError: unable to write error: status error grace period has expired, canceling all sessions on worker”,“@timestamp”:“2022-09-28T22:17:44.122956Z”}

We believe this to be causing silent disconnects for our users (from the client there is no indication of a dropped connection, the port seems to stay bound by boundary but errors are thrown from clients that do anything more than a simple TCP ping). It was something we believed to be related to database resource throttling but as of upgrading to include the db deadlock fix, we aren’t seeing the same resource contention in the Postgres databases, but the frequency of “all sessions” being canceled has increased… in correspondence with the number of users.

We are also seeing these errors somewhat regularly from our workers:

Sep 28 22:17:44 p1lboundwork02.novalocal boundary[11891]: {“@level”:“error”,“@message”:“encountered an error sending an error event”,“@timestamp”:“2022-09-28T22:17:44.122908Z”,“error:”:“event.(Eventer).retrySend: failed to send event: 2 errors occurred:\n\t* event.(Eventer).retrySend: event not written to enough sinks\n\t* context deadline exceeded\n\n”}

Sep 28 22:17:44 p1lboundwork02.novalocal boundary[11891]: {“@level”:“error”,“@message”:“event.WriteError: event.(Eventer).writeError: event.(Eventer).retrySend: failed to send event: 2 errors occurred:\n\t* event.(Eventer).retrySend: event not written to enough sinks\n\t* context deadline exceeded\n\n”,“@timestamp”:“2022-09-28T22:17:44.122937Z”}

Sep 28 22:17:44 p1lboundwork02.novalocal boundary[11891]: 2022-09-28T22:17:44.123Z [ERROR] error event: id=e_Uy01YNq3t3 version=v0.1 op=“worker.(Worker).sendWorkerStatus” info:msg=“error making status request to controller” error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”

Help with any one of these errors or pointing us in the right direction to catalog resources and other potential root causes would be extremely appreciated.

The first error is cause when a worker is unable successfully send a status to a controller within the specified grace period. The grace period has a default of 15secs and can be overriden with the BOUNDARY_STATUS_GRACE_PERIOD env. Boundary will cancel all open sessions when it can’t send worker status successfully because worker status is how sessions are administratively cancelled. If a worker can’t communicate with any controller than it “fails closed” and cancels all of its current sessions.

The event.(Eventer).retrySend errors are caused when boundary times out trying to write an event (timeout == 3 seconds).

It would appear that your workers can’t send status consistently and they are also struggling to write events.

We bumped this BOUNDARY_STATUS_GRACE_PERIOD up and are still running into this issue after the longer wait.

This is unfortunately a misleading env var and doesn’t do what Jim described. There has been work done to break this out into granular timing tweaks, which will go into the 0.11 series, possibly 0.11.1 but maybe 0.11.2.