Core: leadership lost, stopping active operation

Hi Team,

We are using Vault in HA with dynamodb as backend in our AWS infrastructure for last 3 years and even upgraded the vault version couple of times and currently running in Vault1.5.

From the last month we are seeing the below error in every 7-10 days

Jul 21 10:26:00 ip-10-102-208-130 vault[22461]: 2021-07-21T10:26:00.813Z [INFO] expiration: revoked lease: lease_id=auth/oidc/oidc/callback/he5c38814ff8f6057168d489b4acd0eaf90f8835ab414d508299e67b1ee0e9ff1
Jul 22 05:00:15 ip-x.xx.xx.xx vault[22461]: 2021-07-22T05:00:15.803Z [WARN] core: leadership lost, stopping active operation
Jul 22 05:00:15 ip-x.xx.xx.xx vault[22461]: 2021-07-22T05:00:15.803Z [INFO] core: pre-seal teardown starting
Jul 22 05:00:16 ip-x.xx.xx.xx vault[22461]: 2021-07-22T05:00:16.304Z [INFO] rollback: stopping rollback manager
Jul 22 05:00:16 ip-x.xx.xx.xx vault[22461]: 2021-07-22T05:00:16.305Z [INFO] core: pre-seal teardown complete

During the time period if any action is performed on Vault we are seeing below error.
local node not active but active cluster node not found

Could you please suggest on this as we haven’t done any changes to the setup from November 2020.

Could be a number of things that cause it to seal. Storage backend unaccessible, networking blip, overloaded, too many leases, etc.
What does your telemetry show for tokens/cpu/memory/leadership metrics?

Hi Mike,

Have checked the cpu,memory and networks of the Vault servers and all looks good and we don’t see any abnormal spikes or dips.

Similarly with the dynamodb have checked the reads and writes capacity units and throttles and everything is in line with previous trends.

Can’t say it as overloaded as we have HA cluster and it scales up if required on both ec2 server level and capacity units from dynamodb end.
We are trying to update over 3k kv secrets at a time everyday during the same window for the past 2 years and CPU,memory utilization is less than 15%.

Not sure if there is a blip between ec2 and dynamo as we don’t see any issues with other services which uses ec2 and dynamodb.

Any suggestions or any other metrics that I can check during the window to find what went wrong mate.

Audit device performance, temporary throttling from AWS, and move the data to a test cluster and replicate with different loads… 3k KVs should update quickly, but its alot of IO for the audit log and network at one time…
Look at all the timer metrics for the relevant operations going on at the time, see what spikes, etc… Telemetry | Vault by HashiCorp

How many tokens are in the system?

Hi Mike,

We doesn’t have Audit devices enabled for this cluster.

We have 1929 tokens
Key Value

counters map[service_tokens:map[total:1929]]

Have tried to replicate the same by taking a backup and ran with different loads of 3k,5k and 10k in a test cluster with same version and configurations and no issues are observed during this test.