Consistently "cancelling all sessions on worker"

skylar.elijah.donlev · September 28, 2022, 11:35pm

We are running Boundary in a production environment and have recently seen a higher frequency of the following issue:

Sep 28 22:17:44 p1lboundwork02.novalocal boundary[11891]: {“@level”:“error”,“@message”:“event.WriteError: unable to write error: status error grace period has expired, canceling all sessions on worker”,“@timestamp”:“2022-09-28T22:17:44.122956Z”}

We believe this to be causing silent disconnects for our users (from the client there is no indication of a dropped connection, the port seems to stay bound by boundary but errors are thrown from clients that do anything more than a simple TCP ping). It was something we believed to be related to database resource throttling but as of upgrading to include the db deadlock fix, we aren’t seeing the same resource contention in the Postgres databases, but the frequency of “all sessions” being canceled has increased… in correspondence with the number of users.

We are also seeing these errors somewhat regularly from our workers:

Sep 28 22:17:44 p1lboundwork02.novalocal boundary[11891]: {“@level”:“error”,“@message”:“encountered an error sending an error event”,“@timestamp”:“2022-09-28T22:17:44.122908Z”,“error:”:“event.(Eventer).retrySend: failed to send event: 2 errors occurred:\n\t* event.(Eventer).retrySend: event not written to enough sinks\n\t* context deadline exceeded\n\n”}

Sep 28 22:17:44 p1lboundwork02.novalocal boundary[11891]: {“@level”:“error”,“@message”:“event.WriteError: event.(Eventer).writeError: event.(Eventer).retrySend: failed to send event: 2 errors occurred:\n\t* event.(Eventer).retrySend: event not written to enough sinks\n\t* context deadline exceeded\n\n”,“@timestamp”:“2022-09-28T22:17:44.122937Z”}

Sep 28 22:17:44 p1lboundwork02.novalocal boundary[11891]: 2022-09-28T22:17:44.123Z [ERROR] error event: id=e_Uy01YNq3t3 version=v0.1 op=“worker.(Worker).sendWorkerStatus” info:msg=“error making status request to controller” error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”

Help with any one of these errors or pointing us in the right direction to catalog resources and other potential root causes would be extremely appreciated.

jimlambrt · September 29, 2022, 7:31pm

The first error is cause when a worker is unable successfully send a status to a controller within the specified grace period. The grace period has a default of 15secs and can be overriden with the BOUNDARY_STATUS_GRACE_PERIOD env. Boundary will cancel all open sessions when it can’t send worker status successfully because worker status is how sessions are administratively cancelled. If a worker can’t communicate with any controller than it “fails closed” and cancels all of its current sessions.

The event.(Eventer).retrySend errors are caused when boundary times out trying to write an event (timeout == 3 seconds).

It would appear that your workers can’t send status consistently and they are also struggling to write events.

skylar.elijah.donlev · October 17, 2022, 4:26pm

We bumped this BOUNDARY_STATUS_GRACE_PERIOD up and are still running into this issue after the longer wait.

jeff · October 27, 2022, 4:01pm

This is unfortunately a misleading env var and doesn’t do what Jim described. There has been work done to break this out into granular timing tweaks, which will go into the 0.11 series, possibly 0.11.1 but maybe 0.11.2.

Topic		Replies	Views
Sessions stuck in pending state / Worker drops traffic Boundary	2	1040	February 18, 2023
Boundary client is closing connection intermittently for few users and db connection is terminated with "timeout expired" Boundary	1	270	September 6, 2023
Session getting disconnected again and again Boundary	15	476	May 2, 2024
Error: invalid request. request attempted to make second resource with the same field valud must be unique Boundary	6	516	November 17, 2021
Any way to view connected workers? Boundary	3	705	August 15, 2021

Consistently "cancelling all sessions on worker"

Related topics