Core: leadership lost, stopping active operation

sampathrsk · July 22, 2021, 12:04pm

Hi Team,

We are using Vault in HA with dynamodb as backend in our AWS infrastructure for last 3 years and even upgraded the vault version couple of times and currently running in Vault1.5.

From the last month we are seeing the below error in every 7-10 days

Jul 21 10:26:00 ip-10-102-208-130 vault[22461]: 2021-07-21T10:26:00.813Z [INFO] expiration: revoked lease: lease_id=auth/oidc/oidc/callback/he5c38814ff8f6057168d489b4acd0eaf90f8835ab414d508299e67b1ee0e9ff1
Jul 22 05:00:15 ip-x.xx.xx.xx vault[22461]: 2021-07-22T05:00:15.803Z [WARN] core: leadership lost, stopping active operation
Jul 22 05:00:15 ip-x.xx.xx.xx vault[22461]: 2021-07-22T05:00:15.803Z [INFO] core: pre-seal teardown starting
Jul 22 05:00:16 ip-x.xx.xx.xx vault[22461]: 2021-07-22T05:00:16.304Z [INFO] rollback: stopping rollback manager
Jul 22 05:00:16 ip-x.xx.xx.xx vault[22461]: 2021-07-22T05:00:16.305Z [INFO] core: pre-seal teardown complete

During the time period if any action is performed on Vault we are seeing below error.
local node not active but active cluster node not found

Could you please suggest on this as we haven’t done any changes to the setup from November 2020.

mikegreen · July 22, 2021, 4:00pm

Could be a number of things that cause it to seal. Storage backend unaccessible, networking blip, overloaded, too many leases, etc.
What does your telemetry show for tokens/cpu/memory/leadership metrics?

sampathrsk · July 23, 2021, 6:58am

Hi Mike,

Have checked the cpu,memory and networks of the Vault servers and all looks good and we don’t see any abnormal spikes or dips.

Similarly with the dynamodb have checked the reads and writes capacity units and throttles and everything is in line with previous trends.

Can’t say it as overloaded as we have HA cluster and it scales up if required on both ec2 server level and capacity units from dynamodb end.
We are trying to update over 3k kv secrets at a time everyday during the same window for the past 2 years and CPU,memory utilization is less than 15%.

Not sure if there is a blip between ec2 and dynamo as we don’t see any issues with other services which uses ec2 and dynamodb.

Any suggestions or any other metrics that I can check during the window to find what went wrong mate.

mikegreen · July 23, 2021, 12:46pm

Audit device performance, temporary throttling from AWS, and move the data to a test cluster and replicate with different loads… 3k KVs should update quickly, but its alot of IO for the audit log and network at one time…
Look at all the timer metrics for the relevant operations going on at the time, see what spikes, etc… Telemetry | Vault by HashiCorp

How many tokens are in the system?

sampathrsk · July 23, 2021, 1:08pm

Hi Mike,

We doesn’t have Audit devices enabled for this cluster.

We have 1929 tokens
Key Value

counters map[service_tokens:map[total:1929]]

Have tried to replicate the same by taking a backup and ran with different loads of 3k,5k and 10k in a test cluster with same version and configurations and no issues are observed during this test.

Topic		Replies	Views
Vault HA cluster constantly electing a new leader Vault vault	6	2366	August 31, 2020
HA local node not active but active cluster node not found Vault	4	6333	October 13, 2023
Vault Cluster status changes Vault	4	110	April 15, 2024
Issue on a leader in ha cluster Vault Vault	1	703	August 2, 2023
Stuck Creating a HA Cluster - "local node not active, active cluster node not found" Vault	12	4142	June 6, 2024

Core: leadership lost, stopping active operation

Related topics