Downtime when switching from Active to standby

Hello,

Currently I have a cluster which contains 2 Vault servers and 3 consul servers. All of these servers are Ubuntu and are VM’s at VMware. We faced with problem that when we are switching for example first vault node from Active to Standby we have several seconds downtime for request: local node not active but active cluster node not found. After several seconds everything goes well. Is it possible to eliminate this tiny downtime?

I am are doing graceful switching via: vault operator step-down

First, why are you doing a step-down in normal operation?

Second, reference architecture is 3 Vault x 3 Consul nodes - is there a reason you’re running only 2 Vault nodes?

Thirdly, how long does the step-down take and is it consistent?

I’d start by taking a look to the consul logs and instances to see if they’re overloaded during this time, which would slow the lock change over down.

Two nodes Vault, three nodes Consul seems to be ok. And if three nodes of Vault, than five nodes of Consul… Or the learn platform is out of sync with the docs. :upside_down_face:

Yeah that seems out of date. I reference:

Specifically,

In OSS Vault the recommended number of instances is 3 in a cluster as any more would have limited value. In Vault Enterprise the recommended number is also 3 in a cluster, but more can be used if they were performance standbys to help with read-only workload

1 Like

It sounds like the downtime you are seeing is the time taken for the standby to detect the active server is no longer around, handle any leadership election, becoming active and then updating Consul with that detail (assuming you are using Consul for service discovery).

There will always be some level of delay for this (and in a production failure situation it could be a bit longer for the standby to detect that the active node has failed) but it should be pretty short.

1 Like

Hi, I did step-down procedure because I wanted to update that node and that node was active

About architecture I guess Wolfsrudel answered :slight_smile:

Step-down takes about 1- 2 seconds and about consistent is hard to say since you don’t know if you get request in that downtime window

Hi,

Yes, the delay of this step is pretty short, but still it is and we can’t allow to fail any transaction. So my original question was how to eliminate this tiny downtime period :slight_smile:

I don’t think you can.

There will always be some time needed for parts of the system to react to changes, so I think the goal of no failed requests is unrealistic. For example if you had a loadbalancer in front of a set of backends you would only detect failures if a healthcheck failed, or for in-line loadbalancers when a request fails. And even then there might be a delay to prevent a single random failure from removing an entire backend from the pool.

The same is true with Vault. Standby servers will only detect that the active node has failed either because a request to it has failed or a timeout for some form of heartbeat. Again there will likely be some level of failed requests in this period (unless the usage of Vault is so low that several seconds of outage would be unlikely to be noticed).

Performance Standbys (an Enterprise feature) could help to some small degree, depending on your usage patterns. If a lot of your requests are read-only spreading the load between multiple servers could reduce the chance of a request ending up on the failed active node - if there were 10 servers (9 standbys and the active) you would reduce the chance of the request failing during the active node failure from 100% (without performance standbys all requests have to go to the active node, even if via a standby) to 10% (it would fail if you chose the failed active node, but succeed if you hit one of the other 9 nodes). However you would have the extra failure mode of a replica failing - previously only the active node would get traffic, now all nodes do, so a failure of a replica is noticeable when previously it wasn’t.

Performance replicas wouldn’t help at all for write traffic - the active node being down would fail all such requests until one of the standbys took over.

You might be able to tweak settings around heartbeat frequency or timeout duration to make failure detection more sensitive, but that can be really dangerous if it causes failures to be detected when there are none (e.g. due to a single packet loss or delay). Switching the active node takes time and will result in request failures until a new leader is chosen, so you can get into the situation of a total outage if the system detects a failure before the new leader has started - causing another leadership election, triggering another failure detection, etc.

So in summary:

You’ll always have some short period without a valid active node (until the failure is detected and a new active node promoted) when failures happen. Performance replicas could reduce the impact during that period for read-only requests. However all write requests during that time (and possibly some read requests) will fail.

Getting below a few seconds of outage/instability during failures is actually very hard. Very quickly even small improvements become very difficult and expensive.

1 Like

@Nekasas regarding this point by Stuart. What do you have your raft_multipler set to in Consul?

Thank you so much about this brief answer!

Actually I don’t dave any extra configuration regarding raft_multiplier, but actually thanks for this question since I checked information about raft_multiplier and I saw that a default value is 5 and for production recommended value is 1 so I will try that

1 Like

I am seeing the switching time as 30s. I am using a cluster of 1 active and 2 standby vault nodes.
For testing, I am running a python script, which fetches a kv pair in a loop. After switching off a node, I am seeing a 30s delay before there is a response from vault.

The script I am using:

import hvac
import pydash
from datetime import datetime
client = hvac.Client(url = <URL>, timeout = 1)

client.token = <TOKEN>

client.secrets.kv.default_kv_version = 1

while True:
	try:
		existingSecret = client.secrets.kv.v1.read_secret(
		path = 'production/security-service',
		mount_point = 'kv'
		)
		ts = datetime.now().time()
		newSecret = "INIT"
		newSecret = pydash.merge(existingSecret['data'], secret)
		print("Now = ", ts, " secret = ",newSecret[<KEY_NAME>])
	except:
		print("Request failed")

Is there any way to reduce this downtime, I am planning to use vault for encryption/decryption of db values, so I can’t afford a downtime of 30secs.

Please help.

I have a similar issue at work… 30 seconds is quite good, compared to the ~5 minutes it takes our production Vault.

It seems that the design of Vault is that it has to do a lot of initial setup whenever a node is transitioning to active status, and this scales with the number of auth methods, secret engines, and namespaces you have.

Out of interest, how many of those do you have? We have hundreds of namespaces, thousands of auth methods and secret engines.