[SOLVED:] Vault (Linux i386) briefly joining Consul-backed HA cluster before timeout

Hi all,

I’m having an issue with Vault, and I’m slowly coming to the conclusion that it’s the i386 binary I’m running on Ubuntu 18.04.5 LTS. (It’s referenced as [i386].lan in the outputs below.)

Consul seems to be running fine on it:

$ consul members
Node            Address             Status  Type    Build  Protocol  DC    Segment
consul-s0       192.168.1.118:8301  alive   server  1.8.5  2         dc-1  <all>
consul-s1       192.168.1.100:8301  alive   server  1.8.5  2         dc-1  <all>
consul-s2       192.168.1.215:8301  alive   server  1.8.5  2         dc-1  <all>
[amd64].lan     192.168.1.181:8301  alive   client  1.8.5  2         dc-1  <default>
[i386].lan      192.168.1.130:8301  alive   client  1.8.5  2         dc-1  <default>
vault-s0        192.168.1.141:8301  alive   client  1.8.5  2         dc-1  <default>

All the other nodes are recent versions of 64-bit Ubuntu. (consul… and vault… are LXD containers bridged on to the same host.)

The Vault service runs for a while, and then times out; there doesn’t seem to be a pattern to the lines in the log, just before the timeout; I can even get the vault unsealed – if I’m quick! – and have it show up in the Consul UI briefly as “standby, initialized”.

(I’ve included some examples of the timeouts from my logs and the service file itself at the bottom of this post.)

Is it simply that I need to up the start time (e.g., with StartLimitIntervalSec, or some such) in the service file? I don’t understand what threshold isn’t being met, considering I can see an unseal Vault service in Consul, before systemd kills it.

Oh, and I did have an api_addr specified for a while, but that also didn’t make a difference (other than the warning that’s now thrown).

The uncommented bits of the Vault HCL file follow:

log_level = "debug"

telemetry {
  dogstatsd_addr   = "127.0.0.1:8125"
  disable_hostname = true
}

ui = true
disable_mlock = true

storage "consul" {
  address = "127.0.0.1:8500"
  path    = "vault/"
}

listener "tcp" {
  address = "127.0.0.1:8200"
  cluster_address = "127.0.0.1:8201"
  tls_disable = true
}

Anyway, any insights would be greatly appreciated!

JJ

Example 1 of the timeout:

Nov 12 16:26:07 [i386].lan vault[12749]: 2020-11-12T16:26:07.457Z [DEBUG] storage.cache: creating LRU cache: size=0
Nov 12 16:26:07 [i386].lan vault[12749]: 2020-11-12T16:26:07.469Z [DEBUG] cluster listener addresses synthesized: cluster_addresses=[127.0.0.1:8201]
Nov 12 16:26:09 [i386].lan vault[12749]: 2020-11-12T16:26:09.180Z [DEBUG] core: unseal key supplied: migrate=false
Nov 12 16:26:09 [i386].lan vault[12749]: 2020-11-12T16:26:09.211Z [DEBUG] core: cannot unseal, not enough keys: keys=1 threshold=3 nonce=b19ed47f-39ea-caa4-daea-27fd220b6e91
Nov 12 16:26:21 [i386].lan vault[12749]: 2020-11-12T16:26:21.083Z [DEBUG] core: unseal key supplied: migrate=false
Nov 12 16:26:34 [i386].lan vault[12749]: 2020-11-12T16:26:34.489Z [DEBUG] core: unseal key supplied: migrate=false
Nov 12 16:26:34 [i386].lan vault[12749]: 2020-11-12T16:26:34.489Z [DEBUG] core: cannot unseal, not enough keys: keys=2 threshold=3 nonce=b19ed47f-39ea-caa4-daea-27fd220b6e91
Nov 12 16:26:48 [i386].lan vault[12749]: 2020-11-12T16:26:48.005Z [DEBUG] core: unseal key supplied: migrate=false
Nov 12 16:26:48 [i386].lan vault[12749]: 2020-11-12T16:26:48.049Z [DEBUG] core: starting cluster listeners
Nov 12 16:26:48 [i386].lan vault[12749]: 2020-11-12T16:26:48.050Z [INFO]  core.cluster-listener.tcp: starting listener: listener_address=127.0.0.1:8201
Nov 12 16:26:48 [i386].lan vault[12749]: 2020-11-12T16:26:48.050Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=127.0.0.1:8201
Nov 12 16:26:48 [i386].lan vault[12749]: 2020-11-12T16:26:48.063Z [INFO]  core: vault is unsealed
Nov 12 16:26:48 [i386].lan vault[12749]: 2020-11-12T16:26:48.063Z [WARN]  service_registration.consul: concurrent initalize state change notify dropped
Nov 12 16:26:48 [i386].lan vault[12749]: 2020-11-12T16:26:48.064Z [INFO]  core: entering standby mode
Nov 12 16:26:48 [i386].lan vault[12749]: 2020-11-12T16:26:48.123Z [DEBUG] core: parsing information for new active node: active_cluster_addr=https://[amd64].lan:8201 active_redirect_addr=http://[amd64].lan:8200
Nov 12 16:26:48 [i386].lan vault[12749]: 2020-11-12T16:26:48.161Z [DEBUG] core: refreshing forwarding connection
Nov 12 16:26:48 [i386].lan vault[12749]: 2020-11-12T16:26:48.162Z [DEBUG] core: clearing forwarding clients
Nov 12 16:26:48 [i386].lan vault[12749]: 2020-11-12T16:26:48.163Z [DEBUG] core: done clearing forwarding clients
Nov 12 16:26:48 [i386].lan vault[12749]: 2020-11-12T16:26:48.165Z [DEBUG] core: done refreshing forwarding connection
Nov 12 16:26:48 [i386].lan vault[12749]: 2020-11-12T16:26:48.167Z [DEBUG] core.cluster-listener: creating rpc dialer: alpn=req_fw_sb-act_v1 host=fw-c70c6e91-8923-0169-f99e-bd4cd573813e
Nov 12 16:26:48 [i386].lan vault[12749]: 2020-11-12T16:26:48.648Z [DEBUG] core.cluster-listener: performing client cert lookup
Nov 12 16:27:37 [i386].lan systemd[1]: vault.service: Start operation timed out. Terminating.
Nov 12 16:27:37 [i386].lan vault[12749]: ==> Vault shutdown triggered
Nov 12 16:27:37 [i386].lan vault[12749]: 2020-11-12T16:27:37.292Z [INFO]  service_registration.consul: shutting down consul backend
Nov 12 16:27:37 [i386].lan vault[12749]: 2020-11-12T16:27:37.293Z [DEBUG] core: shutdown called
Nov 12 16:27:37 [i386].lan vault[12749]: 2020-11-12T16:27:37.296Z [INFO]  core: marked as sealed
Nov 12 16:27:37 [i386].lan vault[12749]: 2020-11-12T16:27:37.297Z [DEBUG] core: clearing forwarding clients
Nov 12 16:27:37 [i386].lan vault[12749]: 2020-11-12T16:27:37.297Z [DEBUG] core: done clearing forwarding clients
Nov 12 16:27:37 [i386].lan vault[12749]: 2020-11-12T16:27:37.297Z [DEBUG] core: finished triggering standbyStopCh for runStandby
Nov 12 16:27:37 [i386].lan vault[12749]: 2020-11-12T16:27:37.297Z [DEBUG] core: shutting down periodic key rotation checker
Nov 12 16:27:37 [i386].lan vault[12749]: 2020-11-12T16:27:37.297Z [DEBUG] core: shutting down periodic leader refresh
Nov 12 16:27:37 [i386].lan vault[12749]: 2020-11-12T16:27:37.297Z [DEBUG] core: shutting down leader elections
Nov 12 16:27:37 [i386].lan vault[12749]: 2020-11-12T16:27:37.297Z [DEBUG] core: forwarding: stopping heartbeating
Nov 12 16:27:50 [i386].lan vault[12749]: 2020-11-12T16:27:50.053Z [DEBUG] core: runStandby done
Nov 12 16:27:50 [i386].lan vault[12749]: 2020-11-12T16:27:50.053Z [INFO]  core: stopping cluster listeners
Nov 12 16:27:50 [i386].lan vault[12749]: 2020-11-12T16:27:50.053Z [INFO]  core.cluster-listener: forwarding rpc listeners stopped
Nov 12 16:27:50 [i386].lan vault[12749]: 2020-11-12T16:27:50.088Z [INFO]  core.cluster-listener: rpc listeners successfully shut down
Nov 12 16:27:50 [i386].lan vault[12749]: 2020-11-12T16:27:50.088Z [INFO]  core: cluster listeners successfully shut down
Nov 12 16:27:50 [i386].lan vault[12749]: 2020-11-12T16:27:50.088Z [DEBUG] core: sealing barrier
Nov 12 16:27:50 [i386].lan vault[12749]: 2020-11-12T16:27:50.088Z [WARN]  service_registration.consul: concurrent sealed state change notify dropped
Nov 12 16:27:50 [i386].lan vault[12749]: 2020-11-12T16:27:50.088Z [INFO]  core: vault is sealed
Nov 12 16:27:50 [i386].lan systemd[1]: vault.service: Failed with result 'timeout'.
Nov 12 16:27:50 [i386].lan systemd[1]: Failed to start "HashiCorp Vault - A tool for managing secrets".
Nov 12 16:27:55 [i386].lan systemd[1]: vault.service: Service hold-off time over, scheduling restart.
Nov 12 16:27:55 [i386].lan systemd[1]: vault.service: Scheduled restart job, restart counter is at 32.

Example 2 of the timeout:

Nov 12 16:42:17 [i386].lan vault[13017]: 2020-11-12T16:42:17.702Z [DEBUG] storage.cache: creating LRU cache: size=0
Nov 12 16:42:17 [i386].lan vault[13017]: 2020-11-12T16:42:17.777Z [DEBUG] cluster listener addresses synthesized: cluster_addresses=[127.0.0.1:8201]
Nov 12 16:43:47 [i386].lan systemd[1]: vault.service: Start operation timed out. Terminating.
Nov 12 16:43:47 [i386].lan vault[13017]: ==> Vault shutdown triggered
Nov 12 16:43:47 [i386].lan vault[13017]: 2020-11-12T16:43:47.541Z [DEBUG] core: shutdown called
Nov 12 16:43:47 [i386].lan vault[13017]: 2020-11-12T16:43:47.541Z [INFO]  service_registration.consul: shutting down consul backend
Nov 12 16:43:47 [i386].lan systemd[1]: vault.service: Failed with result 'timeout'.
Nov 12 16:43:47 [i386].lan systemd[1]: Failed to start "HashiCorp Vault - A tool for managing secrets".
Nov 12 16:43:52 [i386].lan systemd[1]: vault.service: Service hold-off time over, scheduling restart.
Nov 12 16:43:52 [i386].lan systemd[1]: vault.service: Scheduled restart job, restart counter is at 42.
Nov 12 16:43:52 [i386].lan systemd[1]: Stopped "HashiCorp Vault - A tool for managing secrets".

The service itself looks like this. (Note that I started with the default from the Linux repo, and have commented out and added in some stuff, to no effect.)

$ cat /etc/systemd/system/vault.service 
[Unit]
Description="HashiCorp Vault - A tool for managing secrets"
Documentation=https://www.vaultproject.io/docs/
Requires=network-online.target
After=network-online.target
ConditionFileNotEmpty=/etc/vault.d/vault.hcl
#StartLimitIntervalSec=60
#StartLimitBurst=3

[Service]
Type=notify
User=vault
Group=vault
ProtectSystem=full
ProtectHome=read-only
PrivateTmp=yes
PrivateDevices=yes
SecureBits=keep-caps
AmbientCapabilities=CAP_IPC_LOCK
Capabilities=CAP_IPC_LOCK+ep
CapabilityBoundingSet=CAP_SYSLOG CAP_IPC_LOCK
NoNewPrivileges=yes
ExecStart=/usr/bin/vault server -config=/etc/vault.d/vault.hcl
ExecReload=/bin/kill --signal HUP $MAINPID
KillMode=process
KillSignal=SIGINT
Restart=on-failure
RestartSec=5
#TimeoutStopSec=30
#StartLimitInterval=60
#StartLimitIntervalSec=60
#StartLimitBurst=3
LimitNOFILE=65536
LimitMEMLOCK=infinity
StandardOutput=/var/logs/vault/output.log
StandardError=/var/logs/vault/error.log

[Install]
WantedBy=multi-user.target

And this is what I get once the restart fails and I check the status:

$ sudo systemctl status vault
● vault.service - "HashiCorp Vault - A tool for managing secrets"
   Loaded: loaded (/etc/systemd/system/vault.service; enabled; vendor preset: enabled)
   Active: activating (start) since Thu 2020-11-12 17:43:03 GMT; 48s ago
     Docs: https://www.vaultproject.io/docs/
 Main PID: 14178 (vault)
    Tasks: 7 (limit: 2250)
   CGroup: /system.slice/vault.service
           └─14178 /usr/bin/vault server -config=/etc/vault.d/vault.hcl

Nov 12 17:43:04 [i386].lan vault[14178]: 2020-11-12T17:43:04.158Z [DEBUG] storage.consul: config address set: address=127.0.0.1:8500
Nov 12 17:43:04 [i386].lan vault[14178]: 2020-11-12T17:43:04.177Z [DEBUG] service_registration.consul: config disable_registration set: disable_registration=fal
Nov 12 17:43:04 [i386].lan vault[14178]: 2020-11-12T17:43:04.178Z [DEBUG] service_registration.consul: config service set: service=vault
Nov 12 17:43:04 [i386].lan vault[14178]: 2020-11-12T17:43:04.178Z [DEBUG] service_registration.consul: config service_tags set: service_tags=
Nov 12 17:43:04 [i386].lan vault[14178]: 2020-11-12T17:43:04.178Z [DEBUG] service_registration.consul: config service_address set: service_address=<nil>
Nov 12 17:43:04 [i386].lan vault[14178]: 2020-11-12T17:43:04.178Z [DEBUG] service_registration.consul: config address set: address=127.0.0.1:8500
Nov 12 17:43:04 [i386].lan vault[14178]: 2020-11-12T17:43:04.178Z [WARN]  no `api_addr` value specified in config or in VAULT_API_ADDR; falling back to detectio
Nov 12 17:43:04 [i386].lan vault[14178]: 2020-11-12T17:43:04.193Z [DEBUG] core: set config: sanitized config={"api_addr":"","cache_size":0,"cluster_addr":"","cl
Nov 12 17:43:04 [i386].lan vault[14178]: 2020-11-12T17:43:04.193Z [DEBUG] storage.cache: creating LRU cache: size=0
Nov 12 17:43:04 [i386].lan vault[14178]: 2020-11-12T17:43:04.218Z [DEBUG] cluster listener addresses synthesized: cluster_addresses=[127.0.0.1:8201]

Ahem. So a kind person on Gitter asked me to run the binary on the command line vs as a service, and… It’s still running now. :man_facepalming:

In my defence, I’m using HashiCorp’s service file. Still, time to figure out why it’s killing my perfectly healthy Vault server.

For the benefit of any poor souls who land here, I think I’ll need to disable systemd’s timeout logic with TimeoutSec=infinity (which is not recommended), as per this StackExchange question.

The way I see it, any other value I put in there will just delay the inevitable killing of my Vault server; don’t even know where I’d begin to try and understand why systemd treats this service as forever starting vs running.

Just to close this out: I’ve been running two instances of Vault on Linux i386 binaries for close to a week now; no issues. The only change?

Commented out TimeStopSec in the default service file from the official repo, and added that TimeoutSec line above.