[boundary 0.7.5] Sessions not being cleaned

Boundary Version: 0.7.5
Database: Amazon RDS PostgreSQL 11.13

We’ve been experiencing some huge slowdowns of boundary with a call to the sessions API endpoints taking up to 90s :scream:
After looking at boundary database content, we did noticed in session_list view around 24k+ sessions in active state and 25k+ sessions in pending state.

Our only solution was to manually purge the sessions to return to normal performances:
DELETE FROM session WHERE create_time < NOW() - INTERVAL '1 days';

Is there a way to debug to understand why old sessions are not being cleaned? From what I see in the code, controller should try to cleanup terminated and expired sessions: boundary/repository_session.go at v0.7.5 · hashicorp/boundary · GitHub

I don’t have any debugging help to offer, but if this is easily reproducible, can you test again with 0.7.6? There have been some changes around session handling in that version. I don’t know if any of them would fix your issue, but it would be a useful data point either way.

Hi, thanks for reporting this. One thing to note, the cleanup job that you reference does not delete the sessions from the database, but rather transitions the sessions the to terminated state. It looks for:

// * sessions that have exhausted their connection limit and all their connections are closed.
// * sessions that are expired and all their connections are closed.
// * sessions that are canceling and all their connections are closed

That said I would expect sessions that are older then 8 hours to be expired, unless the targets are configured with a longer session_max_seconds. And that the expired sessions would be transitioned to the terminated state. Can you provide some additional details:

  • Can you confirm what the session_max_seconds is for the corresponding targets?
  • Are there any error logs from either the controllers or workers?
  • What are the session API calls that are being made? Are they from the Desktop Client or from the CLI?

Hi. Sorry for late follow-up

@omkensey I will try to see if we can upgrade to 0.7.6. But I seems to remember that every time we upgrade, we have to ask users to generate new passwords :frowning:

@tmessi

  • I did check directly in the database, and session_max_seconds is set to 28800

  • theses are the errors logs I could see on the controller

pr 19 13:56:14 ip-172-31-13-19 boundary[23135]: 2022-04-19T13:56:14.410+0700 [ERROR] encountered an error sending an error event:
2Apr 19 13:56:14 ip-172-31-13-19 boundary[23135]: error:=
3Apr 19 13:56:14 ip-172-31-13-19 boundary[23135]: | 5 errors occurred:
4Apr 19 13:56:14 ip-172-31-13-19 boundary[23135]: | * event.(Eventer).retrySend: event not written to enough sinks
5Apr 19 13:56:14 ip-172-31-13-19 boundary[23135]: | * event.(Eventer).retrySend: event not written to enough sinks
6Apr 19 13:56:14 ip-172-31-13-19 boundary[23135]: | * event.(Eventer).retrySend: event not written to enough sinks
7Apr 19 13:56:14 ip-172-31-13-19 boundary[23135]: | * event.(Eventer).retrySend: event not written to enough sinks
8Apr 19 13:56:14 ip-172-31-13-19 boundary[23135]: | * event.(Eventer).retrySend: reached max of 3: too many retries
9Apr 19 13:56:14 ip-172-31-13-19 boundary[23135]: |
10Apr 19 13:56:14 ip-172-31-13-19 boundary[23135]:
11Apr 19 13:56:14 ip-172-31-13-19 boundary[23135]: 2022-04-19T13:56:14.410+0700 [ERROR] event.WriteError: event.(Eventer).writeError: 5 errors occurred:
12Apr 19 13:56:14 ip-172-31-13-19 boundary[23135]: * event.(Eventer).retrySend: event not written to enough sinks
13Apr 19 13:56:14 ip-172-31-13-19 boundary[23135]: * event.(Eventer).retrySend: event not written to enough sinks
14Apr 19 13:56:14 ip-172-31-13-19 boundary[23135]: * event.(Eventer).retrySend: event not written to enough sinks
15Apr 19 13:56:14 ip-172-31-13-19 boundary[23135]: * event.(Eventer).retrySend: event not written to enough sinks
16Apr 19 13:56:14 ip-172-31-13-19 boundary[23135]: * event.(Eventer).retrySend: reached max of 3: too many retries
17Apr 19 13:56:14 ip-172-31-13-19 boundary[23135]: 2022-04-19T13:56:14.410+0700 [ERROR] event.WriteError: unable to write error: rpc error: code = DeadlineExceeded desc = context deadline exceeded

  • the API calls being made and taking a very long time were to sessions endpoints

Apr 19 15:34:32 ip-172-31-17-193 boundary[37271]: {“id”:“OwYjUWmyIa”,“source”:“https://hashicorp.com/boundary/boundary-controller-3",“specversion”:“1.0”,“type”:“observation”,“data”:{“latency-ms”:90134.103818,“request_info”:{“id”:“gtraceid_EbSc0wROlGgnnYeICpvq”,“method”:“GET”,“path”:"/v1/sessions?filter=("%2Fitem%2Fuser_id"+%3D%3D+“u_1FGuqi0QLe”)\u0026recursive=true\u0026scope_id=o_ijIHpRpl8t",“public_id”:“at_0svGug6K3X”,“client_ip”:“xx.xxx.xxx.xxx”},“status”:500,“stop”:“2022-04-19T15:34:32.896815692+07:00”,“version”:“v0.1”},“datacontentype”:“application/cloudevents”,“time”:"2022-04-19T15:34:32.897148626+07:00”}

Do you have any idea what could cause the issue? Or do you think we should just try to upgrade to latest boundary version?

I’ve been using version 0.8.1.1, and I have the same errors on the log, cleaning the view really helps.

I’m upgrading to latest stable version to see if it works better