Worker unresponsive after sometime

swapnil-raj · November 7, 2022, 4:50pm

I am running boundary v0.11.0 on k8s with 3 controllers and 3 workers. When I start boundary controller and worker everything works fine but after sometime workers become unresponsive.

controller config 
disable_mlock = true
controller {
    name = "boundary-controller"
    description = "Boundary Controller"
    public_cluster_addr = "boundary-controller-hl.boundary.svc.cluster.local"
    database {
    url = "postgres url"
}
}
listener "tcp" {
purpose = "api"
address = "0.0.0.0"
tls_disable = true
public_addr = "boundary-controller-hl.boundary.svc.cluster.local"
}
listener "tcp" {
    address = "0.0.0.0"
    purpose = "cluster"
    tls_disable = true
    public_addr = "boundary-controller-hl.boundary.svc.cluster.local"
}
kms "aead" {
    purpose = "root"
    aead_type = "aes-gcm"
    key = "key"
    key_id = "global_root"
    public_addr = "boundary-controller-hl.boundary.svc.cluster.local"
}
kms "aead" {
    purpose = "worker-auth"
    aead_type = "aes-gcm"
    key = "key"
    key_id = "global_worker-auth"
}
kms "aead" {
    purpose = "recovery"
    aead_type = "aes-gcm"
    key = "key="
    key_id = "global_recovery"
}

worker config

disable_mlock = true
			 listener "tcp" {
			 purpose = "proxy"
			 address = "0.0.0.0"
			 tls_disable = true
			 }
			 worker {
			 name = "env://HOSTNAME"
			 description = "Boundar k8s worker"
			 initial_upstreams = ["boundary-controller-hl.boundary.svc.cluster.local"]
			 public_addr = "public DNS"
			 tags {
				 region    = ["k8s"]
			 }
			 }
			 # Worker authorization KMS
			 # Use a production KMS such as AWS KMS for production installs
			 # This key is the same key used in the worker configuration
			 kms "aead" {
				 purpose = "worker-auth"
				 aead_type = "aes-gcm"
				 key = "key"
				 key_id = "global_worker-auth"
			 }

It generally works for a day and then stops working until i restart the pod.

I see these error logs before worker stops working

{“id”:“3HyVunohBm”,“source”:“https://hashicorp.com/boundary/boundary-worker-8qdn8/worker",“specversion”:“1.0”,“type”:“error”,“data”:{“error”:"failed to read protobuf message: failed to get reader: failed to read frame header: EOF”,“error_fields”:{},“id”:“e_I9lbsVu4LP”,“version”:“v0.1”,“op”:“worker.(Worker).handleProxy”,“info”:{“msg”:“error reading handshake from client”}},“datacontentype”:“application/cloudevents”,“time”:“2022-11-07T15:37:06.905432729Z”}
{“id”:“fl6DyubRiA”,“source”:“https://hashicorp.com/boundary/boundary-worker-8qdn8/worker",“specversion”:“1.0”,“type”:“error”,“data”:{“error”:"failed to close WebSocket: failed to write control frame opClose: WebSocket closed: failed to read frame header: EOF”,“error_fields”:{},“id”:“e_8WduHdtyWv”,“version”:“v0.1”,“op”:“worker.(Worker).handleProxy”,“info”:{“msg”:“error closing client connection”}},“datacontentype”:“application/cloudevents”,“time”:“2022-11-07T15:37:06.906030265Z”}

One more thing that i saw is, when it is working if i do
curl -v 0.0.0.0:9202 i immediately get

*   Trying 0.0.0.0:9202...
* Connected to 0.0.0.0 (127.0.0.1) port 9202 (#0)
> GET / HTTP/1.1
> Host: 0.0.0.0:9202
> User-Agent: curl/7.79.1
> Accept: */*
> 
* Empty reply from server
* Closing connection 0
curl: (52) Empty reply from server

but when its not working the server does not reply anything, it gets stuck on

*   Trying 127.0.0.1:9202...
* Connected to 127.0.0.1 (127.0.0.1) port 9202 (#0)
> GET / HTTP/1.1
> Host: 127.0.0.1:9202
> User-Agent: curl/7.79.1
> Accept: */*
>

What can be the issue?

Thanks in advance.

jimlambrt · November 10, 2022, 4:01pm

I’m curious if you can send the worker a SIGQUIT before restarting the POD the next time and post the stack trace here.

Topic		Replies	Views
Unable to connect to instance though SSH Boundary	2	1207	June 27, 2022
Boundary worker not connecting and getting odd message in log Boundary	14	821	April 4, 2023
Boundary worker shutsdown because awskms plugin fails Boundary	14	558	February 22, 2023
TLS Handshake Error and Worker Type Change After Upgrading to Version 0.14.3 Boundary boundary	8	833	March 7, 2024
Boundary setup initial auth failing when using AWS kms Boundary	7	621	September 28, 2022

Worker unresponsive after sometime

Related topics