Failed to commit WAL entry

We are running Vault 1.4.2 with role and cred creation on AWS IAM. We have been encountering a “failed to commit WAL entry” error for the following reason.

Delete http://127.0.0.1:8500/v1/kv/vault/logical/a3842a67-8dc7-0cba-044e-1e020cc22f8a/wal/7c402cea-73d7-21c4-6bcc-6c9c7650626c: context canceled

We have found the AWS role and cred are created successfully when we refer to the cloudtrail logs.

Our backing store is Consul 1.7.3 and its logs do not show any problems around the same time frame.

2021-10-05T18:07:43.676Z [INFO]  agent.server.fsm: snapshot created: duration=31.71µs
 agent.server.fsm: snapshot created: duration=31.71µs
2021-10-05T18:07:43.676Z [INFO]  agent.server.raft: starting snapshot up to: index=1388953561
2021-10-05T18:07:43.676Z [INFO]  snapshot: creating new snapshot: path=/mnt/consul/raft/snapshots/30455-1388953561-1633457263676.tmp
 agent.server.raft: starting snapshot up to: index=1388953561
 snapshot: creating new snapshot: path=/mnt/consul/raft/snapshots/30455-1388953561-1633457263676.tmp
2021-10-05T18:08:01.125Z [WARN]  snapshot: found temporary snapshot: name=30335-253014623-1595344187238.tmp
 snapshot: found temporary snapshot: name=30335-253014623-1595344187238.tmp
2021-10-05T18:08:01.125Z [INFO]  snapshot: reaping snapshot: path=/mnt/consul/raft/snapshots/30455-1388917430-1633456865863
 snapshot: reaping snapshot: path=/mnt/consul/raft/snapshots/30455-1388917430-1633456865863
 agent.server.raft: compacting logs: from=1388925941 to=1388945178
2021-10-05T18:08:01.527Z [INFO]  agent.server.raft: compacting logs: from=1388925941 to=1388945178
2021-10-05T18:08:01.546Z [INFO]  agent.server.raft: snapshot complete up to: index=1388953561
 agent.server.raft: snapshot complete up to: index=1388953561

When looking into the commit failed error I see the following message. What I do not understand is what does it really mean and how can we fix it.

// Remove the WAL entry, we succeeded! If we fail, we don't return  
	// the secret because it'll get rolled back anyways, so we have to return  
	// an error here.  
	if err := framework.DeleteWAL(ctx, s, walID); err != nil {  
		return nil, fmt.Errorf("failed to commit WAL entry: %w", err)  
	}  

Does anyone have any guidance into where I should be looking next?

A note, 1.4 is long been deprecated and is no longer supported. 1.6 is the oldest version supported and 1.8 is the current release. You’re looking at the code base of 1.8 … so just a word of caution. I would highly suggest upgrading.

Looks like vault is trying to delete something off of the consul backend, context cancelled usually means: “could not reach,” or “timeout”. Something is wrong somewhere but you need to do some more investigation (possibly upgrade). Vault and Consul are both very susceptible to very fine lines of timeouts so small changes to the performance/env/network routing can cause new issues that were previously not an issue.

Thank you for the reply. I will look into upgrading and thank you for pointing out I was looking at code from version 1.8. I went back and looked at the code for our version and the same message exists.

I will dig further into Consul. Our consul service runs on the same service as Vault so, thankfully, we don’t have to worry about routing but performance issues are definitely a concern here.

Consul (or Vault with Integrated storage) is VERY susceptible to I/O limits. One second it’s fine and the next it’s throwing a hissy fit with odd storage and context errors. You have to keep a very close eye on the I/O and time PUT times.

What ranges of time is considered good and bad for PUT times? Is there any documentation that explains what to look for?

Not that I know of. Too many variables to account for. Keep a long history of what is good and when you get the errors so you can figure out what’s a good value for your environment.

That is what we have done. We were not capturing all the metrics necessary but that is changing now.

Thank you for your help.