Vault stucks in reloading state

Hi All,

We are running Vault main instance in Raft Storage Backend, which gets auto unsealed using transit engine. Recently, we have enabled the audit device and we would like to rotate the audit logs every 7 days, for trial purpose. We are using logrotate.

The problem is, that the Vault Service gets stuck in a reloading state. However, the log rotation is being successful, and new audit logs are being written to the new audit file. Can someone explain why the service is unable to reload successfully?

Vault logrotate Config:

/var/log/vault/audit.log {
  daily
  rotate 7
  notifempty
  missingok
  compress
  delaycompress
  postrotate
#     /usr/bin/systemctl reload vault 2> /dev/null || true
     /usr/bin/kill -HUP `pidof vault` >> /home/username/stderr.log 2>&1 || true
  endscript
  create 0600 vault vault
sudo logrotate -fv /etc/logrotate.d/vault

After that,

sudo systemctl status vault.service

Note: Unable to reload using systemctl and HUP command.

Thank you.

That’s the right process. We’re doing the same thing without issue. I don’t think the logrotate is the issue, something is going on with the reload.

Can you change your log level to DEBUG and just do the HUP to see if it gets stuck and what the errors in the system log are?

I have enabled debug mode in Vault Config. (log_level= “Debug”)

Tried reloading with both systemctl and HUP, same logs are appearing.

Check your service file to make sure you’re set to process and not control group?

This is the Vault Service file. Can you check the part which you mentioned?

It’s the KillMode line which is already set to process. I’m at a loss, not sure why it isn’t restarting correctly. What is the version of Vault? What OS and version are you running on? Is there a firewall or any blocking tools installed?

Yeah. Please see the Vault Version and OS Version. There’s no firewall blocking. We’ve been trying to figure out what’s causing the issue . But, no luck at all. There are some GitHub issues related to it, but they aren’t of much use.

Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    5
Threshold                3
Version                  1.10.0
Storage Type             raft
Cluster Name             vault-cluster-570b3133
Cluster ID               74ecb77f-36e9-2aa0-586d-7697da410c4b
HA Enabled               true
HA Cluster               https://192.168.56.103:8201
HA Mode                  active
Active Since             2022-04-28T10:13:11.466732751Z
Raft Committed Index     535
Raft Applied Index       535

OS Type and Version:

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04 LTS
Release:        20.04
Codename:       focal

This looks like a bug in the Vault code.

It notifies systemd that it has started reloading, but fails to implement the ready notification once the reload is complete.

Okay. Can I open a bug issue regarding this in the github? or ARewe missing something here? Can you please once clarify.? @maxb @aram

First thing they’re going to tell you is to upgrade to the latest version. Second run the hcdiag utility to collect system information. vault operator diagnose -config /etc/vault.d/vault.hcl is the other command.

All that put together if they don’t tell you anything, then yeah you can open an issue on github.

Hi @aram ,

Thanks for suggesting. I have run the diagnose and seems there’s something wrong with that. Previously, we were using file system backend and migrated to raft storage. The /opt/vault/data is the path previous file backend is using as storage.

The vault.db has enough permissions, still showing permission denied. Any idea of this?

Results:
[ failure ] Vault Diagnose
  [ warning ] Check Operating System
    [ warning ] Check Open File Limits: Open file limits are set to 1024
      These limits may be insufficient. We recommend raising the soft and hard limits to 1024768.
    [ success ] Check Disk Usage: / usage ok.
    [ warning ] Check Disk Usage: /snap/bare/5 is 100.00 percent full.
      It is recommended to have more than five percent of the partition free.
    [ warning ] Check Disk Usage: /snap/chromium/1967 is 100.00 percent full.
      It is recommended to have more than five percent of the partition free.
    [ warning ] Check Disk Usage: /snap/core18/2284 is 100.00 percent full.
      It is recommended to have more than five percent of the partition free.
    [ warning ] Check Disk Usage: /snap/core20/1405 is 100.00 percent full.
      It is recommended to have more than five percent of the partition free.
    [ warning ] Check Disk Usage: /snap/gtk-common-themes/1506 is 100.00 percent full.
      It is recommended to have more than five percent of the partition free.
    [ warning ] Check Disk Usage: /snap/gtk-common-themes/1519 is 100.00 percent full.
      It is recommended to have more than five percent of the partition free.
    [ warning ] Check Disk Usage: /snap/core18/2344 is 100.00 percent full.
      It is recommended to have more than five percent of the partition free.
    [ warning ] Check Disk Usage: /snap/gnome-3-38-2004/99 is 100.00 percent full.
      It is recommended to have more than five percent of the partition free.
    [ warning ] Check Disk Usage: /snap/snap-store/558 is 100.00 percent full.
      It is recommended to have more than five percent of the partition free.
    [ warning ] Check Disk Usage: /snap/snapd/15177 is 100.00 percent full.
      It is recommended to have more than five percent of the partition free.
    [ warning ] Check Disk Usage: /snap/snap-store/433 is 100.00 percent full.
      It is recommended to have more than five percent of the partition free.
    [ warning ] Check Disk Usage: /snap/gnome-3-34-1804/77 is 100.00 percent full.
      It is recommended to have more than five percent of the partition free.
    [ warning ] Check Disk Usage: /snap/gnome-3-34-1804/24 is 100.00 percent full.
      It is recommended to have more than five percent of the partition free.
    [ success ] Check Disk Usage: /home usage ok.
    [ warning ] Check Disk Usage: /snap/snapd/15534 is 100.00 percent full.
      It is recommended to have more than five percent of the partition free.
    [ warning ] Check Disk Usage: /snap/core20/1434 is 100.00 percent full.
      It is recommended to have more than five percent of the partition free.
    [ warning ] Check Disk Usage: /snap/chromium/1973 is 100.00 percent full.
      It is recommended to have more than five percent of the partition free.
  [ success ] Parse Configuration
  [ warning ] Check Telemetry: Telemetry is using default configuration
    By default only Prometheus and JSON metrics are available.  Ignore this warning if you are using telemetry or are using these metrics and are
    satisfied with the default retention time and gauge period.
  [ failure ] Check Storage: Diagnose could not initialize storage backend.
    [ failure ] Create Storage Backend: Error initializing storage of type raft: failed to create fsm: failed to open bolt file: open
      /opt/vault/raft/vault.db: permission denied

Did you run the diagnose as the vault user?

Yes. Vault command runs as vault user only.

FYI, while migrating, we were facing this issue. So, migration was done using sudo vault migrate -config=migrate.hcl ,which was successful.

vault operator migrate -config=migrate.hcl                                                                                  ──(Fri,Apr29)─┘
Error migrating: error mounting 'storage_destination': failed to create fsm: failed to open bolt file: open /opt/vault/data/raft/vault.db: permission denied

Yes please - go straight ahead and open the bug - I can see it’s still present in the latest version.

Systemd is notified that the reload starts here:

and there is no “ready” notification sent afterwards.

Okay. Thank you for the inputs @maxb .

One more thing is, we weren’t not able to migrate vault with sudo? Can you help in figuring out?

While diagnosing too, it is able to pass all the checks only with sudo.
sudo vault operator diagnose -config /etc/vault.d/<configfile>

Both diagnose and migration are supposed to be run with the Vault service stopped.

I think the issue you’re seeing is potentially the files being locked by the running Vault service.

We were so sure that the Vault Service is stopped before executing the command.

I am trying to do the migration now. It outputs this. Please see the picture

Migration Config:

storage_source "file" {
address = "192.168.56.103:8200"
path    = "/opt/vault/data"
}

storage_destination "raft" {
  path = "/home/solus/vault/data/raft/"
  node_id = "node_1"
}

cluster_addr = "http://192.168.56.103:8201"

Vault Config:

# Full configuration options can be found at https://www.vaultproject.io/docs/configuration

# Storage configuration
storage "raft" {
  path = "/home/solus/vault/data/raft"
  node_id = "node_1"
}

listener "tcp" {
  address = "192.168.56.103:8200"
  tls_disable = "true"
}

api_addr = "http://192.168.56.103:8200"
cluster_addr = "http://192.168.56.103:8201"
disable_mlock = true
ui=true

seal "transit" {
  address = "http://192.168.56.102:8200"
  disable_renewal = "false"
  key_name = "autounseal"
  mount_path = "transit/"
  token = "hvs.CAESILFRZgzoOQi32ijg3jCVYIeJ1AzWZ2FT40QS14iYer4PGh4KHGh2cy5VTWlVSjdGbFVoTlVBcEw0RFR1Sm40dXc"
  tls_skip_verify = "true"
}

log_level = "Debug"

I imagine there is a problem with the file permissions.

I note that the directory in your most recent post is different from the one where you were showing the file permissions earlier in this thread.

I have deleted the previous Vault and get new Vault running from scratch to start the migration again without sudo. I just don’t get it, the file and directories were all owned by vault. Still, migration isn’t working.

Vault Config file permissions

-rw-r--r-- 1 vault vault 684 Apr 29 16:22 /etc/vault.d/vault.hcl 

Current Storage backend permissions:

$ ls -l /opt/vault                                                                                                            ──(Fri,Apr29)─┘
total 12
drwxr-xr-x 5 vault vault 4096 Apr 29 16:20 data
drwx------ 2 vault vault 4096 Apr 12 20:07 tls
drwxr-xr-x 2 vault vault 4096 Apr 26 13:37 vault-audit

Migration directory permissions:

$  ls -lR vault/
vault/:
total 4
drwxr-xr-x 3 vault vault 4096 Apr 29 16:21 data

vault/data:
total 4
drwxr-xr-x 2 vault vault 4096 Apr 29 16:21 raft

vault/data/raft:
total 0