We are running Vault 1.4.2 with role and cred creation on AWS IAM. We have been encountering a “failed to commit WAL entry” error for the following reason.
We have found the AWS role and cred are created successfully when we refer to the cloudtrail logs.
Our backing store is Consul 1.7.3 and its logs do not show any problems around the same time frame.
2021-10-05T18:07:43.676Z [INFO] agent.server.fsm: snapshot created: duration=31.71µs
agent.server.fsm: snapshot created: duration=31.71µs
2021-10-05T18:07:43.676Z [INFO] agent.server.raft: starting snapshot up to: index=1388953561
2021-10-05T18:07:43.676Z [INFO] snapshot: creating new snapshot: path=/mnt/consul/raft/snapshots/30455-1388953561-1633457263676.tmp
agent.server.raft: starting snapshot up to: index=1388953561
snapshot: creating new snapshot: path=/mnt/consul/raft/snapshots/30455-1388953561-1633457263676.tmp
2021-10-05T18:08:01.125Z [WARN] snapshot: found temporary snapshot: name=30335-253014623-1595344187238.tmp
snapshot: found temporary snapshot: name=30335-253014623-1595344187238.tmp
2021-10-05T18:08:01.125Z [INFO] snapshot: reaping snapshot: path=/mnt/consul/raft/snapshots/30455-1388917430-1633456865863
snapshot: reaping snapshot: path=/mnt/consul/raft/snapshots/30455-1388917430-1633456865863
agent.server.raft: compacting logs: from=1388925941 to=1388945178
2021-10-05T18:08:01.527Z [INFO] agent.server.raft: compacting logs: from=1388925941 to=1388945178
2021-10-05T18:08:01.546Z [INFO] agent.server.raft: snapshot complete up to: index=1388953561
agent.server.raft: snapshot complete up to: index=1388953561
When looking into the commit failed error I see the following message. What I do not understand is what does it really mean and how can we fix it.
// Remove the WAL entry, we succeeded! If we fail, we don't return
// the secret because it'll get rolled back anyways, so we have to return
// an error here.
if err := framework.DeleteWAL(ctx, s, walID); err != nil {
return nil, fmt.Errorf("failed to commit WAL entry: %w", err)
}
Does anyone have any guidance into where I should be looking next?
A note, 1.4 is long been deprecated and is no longer supported. 1.6 is the oldest version supported and 1.8 is the current release. You’re looking at the code base of 1.8 … so just a word of caution. I would highly suggest upgrading.
Looks like vault is trying to delete something off of the consul backend, context cancelled usually means: “could not reach,” or “timeout”. Something is wrong somewhere but you need to do some more investigation (possibly upgrade). Vault and Consul are both very susceptible to very fine lines of timeouts so small changes to the performance/env/network routing can cause new issues that were previously not an issue.
Thank you for the reply. I will look into upgrading and thank you for pointing out I was looking at code from version 1.8. I went back and looked at the code for our version and the same message exists.
I will dig further into Consul. Our consul service runs on the same service as Vault so, thankfully, we don’t have to worry about routing but performance issues are definitely a concern here.
Consul (or Vault with Integrated storage) is VERY susceptible to I/O limits. One second it’s fine and the next it’s throwing a hissy fit with odd storage and context errors. You have to keep a very close eye on the I/O and time PUT times.
Not that I know of. Too many variables to account for. Keep a long history of what is good and when you get the errors so you can figure out what’s a good value for your environment.