Downtime when switching from Active to standby

Nekasas · February 17, 2021, 3:08pm

Hello,

Currently I have a cluster which contains 2 Vault servers and 3 consul servers. All of these servers are Ubuntu and are VM’s at VMware. We faced with problem that when we are switching for example first vault node from Active to Standby we have several seconds downtime for request: local node not active but active cluster node not found. After several seconds everything goes well. Is it possible to eliminate this tiny downtime?

I am are doing graceful switching via: vault operator step-down

mikegreen · February 17, 2021, 4:15pm

First, why are you doing a step-down in normal operation?

Second, reference architecture is 3 Vault x 3 Consul nodes - is there a reason you’re running only 2 Vault nodes?

Thirdly, how long does the step-down take and is it consistent?

I’d start by taking a look to the consul logs and instances to see if they’re overloaded during this time, which would slow the lock change over down.

Wolfsrudel · February 17, 2021, 5:16pm

Two nodes Vault, three nodes Consul seems to be ok. And if three nodes of Vault, than five nodes of Consul… Or the learn platform is out of sync with the docs.

mikegreen · February 17, 2021, 6:10pm

Yeah that seems out of date. I reference:

Specifically,

In OSS Vault the recommended number of instances is 3 in a cluster as any more would have limited value. In Vault Enterprise the recommended number is also 3 in a cluster, but more can be used if they were performance standbys to help with read-only workload

stuart-c · February 17, 2021, 9:58pm

It sounds like the downtime you are seeing is the time taken for the standby to detect the active server is no longer around, handle any leadership election, becoming active and then updating Consul with that detail (assuming you are using Consul for service discovery).

There will always be some level of delay for this (and in a production failure situation it could be a bit longer for the standby to detect that the active node has failed) but it should be pretty short.

Nekasas · February 18, 2021, 6:30am

Hi, I did step-down procedure because I wanted to update that node and that node was active

About architecture I guess Wolfsrudel answered

Step-down takes about 1- 2 seconds and about consistent is hard to say since you don’t know if you get request in that downtime window

Nekasas · February 18, 2021, 6:35am

Hi,

Yes, the delay of this step is pretty short, but still it is and we can’t allow to fail any transaction. So my original question was how to eliminate this tiny downtime period

stuart-c · February 18, 2021, 6:51pm

I don’t think you can.

There will always be some time needed for parts of the system to react to changes, so I think the goal of no failed requests is unrealistic. For example if you had a loadbalancer in front of a set of backends you would only detect failures if a healthcheck failed, or for in-line loadbalancers when a request fails. And even then there might be a delay to prevent a single random failure from removing an entire backend from the pool.

The same is true with Vault. Standby servers will only detect that the active node has failed either because a request to it has failed or a timeout for some form of heartbeat. Again there will likely be some level of failed requests in this period (unless the usage of Vault is so low that several seconds of outage would be unlikely to be noticed).

Performance Standbys (an Enterprise feature) could help to some small degree, depending on your usage patterns. If a lot of your requests are read-only spreading the load between multiple servers could reduce the chance of a request ending up on the failed active node - if there were 10 servers (9 standbys and the active) you would reduce the chance of the request failing during the active node failure from 100% (without performance standbys all requests have to go to the active node, even if via a standby) to 10% (it would fail if you chose the failed active node, but succeed if you hit one of the other 9 nodes). However you would have the extra failure mode of a replica failing - previously only the active node would get traffic, now all nodes do, so a failure of a replica is noticeable when previously it wasn’t.

Performance replicas wouldn’t help at all for write traffic - the active node being down would fail all such requests until one of the standbys took over.

You might be able to tweak settings around heartbeat frequency or timeout duration to make failure detection more sensitive, but that can be really dangerous if it causes failures to be detected when there are none (e.g. due to a single packet loss or delay). Switching the active node takes time and will result in request failures until a new leader is chosen, so you can get into the situation of a total outage if the system detects a failure before the new leader has started - causing another leadership election, triggering another failure detection, etc.

So in summary:

You’ll always have some short period without a valid active node (until the failure is detected and a new active node promoted) when failures happen. Performance replicas could reduce the impact during that period for read-only requests. However all write requests during that time (and possibly some read requests) will fail.

Getting below a few seconds of outage/instability during failures is actually very hard. Very quickly even small improvements become very difficult and expensive.

mikegreen · February 18, 2021, 7:36pm

@Nekasas regarding this point by Stuart. What do you have your raft_multipler set to in Consul?

Nekasas · February 19, 2021, 6:21am

Thank you so much about this brief answer!

Nekasas · February 19, 2021, 6:27am

Actually I don’t dave any extra configuration regarding raft_multiplier, but actually thanks for this question since I checked information about raft_multiplier and I saw that a default value is 5 and for production recommended value is 1 so I will try that

yashbharadwaj · July 27, 2022, 6:50am

I am seeing the switching time as 30s. I am using a cluster of 1 active and 2 standby vault nodes.
For testing, I am running a python script, which fetches a kv pair in a loop. After switching off a node, I am seeing a 30s delay before there is a response from vault.

The script I am using:

import hvac
import pydash
from datetime import datetime
client = hvac.Client(url = <URL>, timeout = 1)

client.token = <TOKEN>

client.secrets.kv.default_kv_version = 1

while True:
	try:
		existingSecret = client.secrets.kv.v1.read_secret(
		path = 'production/security-service',
		mount_point = 'kv'
		)
		ts = datetime.now().time()
		newSecret = "INIT"
		newSecret = pydash.merge(existingSecret['data'], secret)
		print("Now = ", ts, " secret = ",newSecret[<KEY_NAME>])
	except:
		print("Request failed")

Is there any way to reduce this downtime, I am planning to use vault for encryption/decryption of db values, so I can’t afford a downtime of 30secs.

Please help.

maxb · July 27, 2022, 7:21am

I have a similar issue at work… 30 seconds is quite good, compared to the ~5 minutes it takes our production Vault.

It seems that the design of Vault is that it has to do a lot of initial setup whenever a node is transitioning to active status, and this scales with the number of auth methods, secret engines, and namespaces you have.

Out of interest, how many of those do you have? We have hundreds of namespaces, thousands of auth methods and secret engines.

Topic		Replies	Views
Vault standby doesn't becomes active immediately when vault-active is down Vault vault	13	1687	November 30, 2022
How to change HA backend with 0 downtime? Vault	2	1378	February 27, 2020
HA active/standby mode switching frequency Vault	2	1379	September 16, 2019
Vault Enterprise Upgrade Process Vault	0	176	January 23, 2024
Consul reporting two Vault active nodes Consul vault	7	540	November 21, 2022

Downtime when switching from Active to standby

Related topics