Failed to commit WAL entry

michaelplemmons · October 5, 2021, 10:07pm

We are running Vault 1.4.2 with role and cred creation on AWS IAM. We have been encountering a “failed to commit WAL entry” error for the following reason.

Delete http://127.0.0.1:8500/v1/kv/vault/logical/a3842a67-8dc7-0cba-044e-1e020cc22f8a/wal/7c402cea-73d7-21c4-6bcc-6c9c7650626c: context canceled

We have found the AWS role and cred are created successfully when we refer to the cloudtrail logs.

Our backing store is Consul 1.7.3 and its logs do not show any problems around the same time frame.

2021-10-05T18:07:43.676Z [INFO]  agent.server.fsm: snapshot created: duration=31.71µs
 agent.server.fsm: snapshot created: duration=31.71µs
2021-10-05T18:07:43.676Z [INFO]  agent.server.raft: starting snapshot up to: index=1388953561
2021-10-05T18:07:43.676Z [INFO]  snapshot: creating new snapshot: path=/mnt/consul/raft/snapshots/30455-1388953561-1633457263676.tmp
 agent.server.raft: starting snapshot up to: index=1388953561
 snapshot: creating new snapshot: path=/mnt/consul/raft/snapshots/30455-1388953561-1633457263676.tmp
2021-10-05T18:08:01.125Z [WARN]  snapshot: found temporary snapshot: name=30335-253014623-1595344187238.tmp
 snapshot: found temporary snapshot: name=30335-253014623-1595344187238.tmp
2021-10-05T18:08:01.125Z [INFO]  snapshot: reaping snapshot: path=/mnt/consul/raft/snapshots/30455-1388917430-1633456865863
 snapshot: reaping snapshot: path=/mnt/consul/raft/snapshots/30455-1388917430-1633456865863
 agent.server.raft: compacting logs: from=1388925941 to=1388945178
2021-10-05T18:08:01.527Z [INFO]  agent.server.raft: compacting logs: from=1388925941 to=1388945178
2021-10-05T18:08:01.546Z [INFO]  agent.server.raft: snapshot complete up to: index=1388953561
 agent.server.raft: snapshot complete up to: index=1388953561

When looking into the commit failed error I see the following message. What I do not understand is what does it really mean and how can we fix it.

github.com

hashicorp/vault/blob/main/builtin/logical/aws/secret_access_keys.go#L400

    
      
          	}
          
          
	// Create the keys
          	keyResp, err := iamClient.CreateAccessKey(&iam.CreateAccessKeyInput{
          		UserName: aws.String(username),
          	})
          	if err != nil {
          		return logical.ErrorResponse("Error creating access keys: %s", err), awsutil.CheckAWSError(err)
          	}
          
          
	// Remove the WAL entry, we succeeded! If we fail, we don't return
          	// the secret because it'll get rolled back anyways, so we have to return
          	// an error here.
          	if err := framework.DeleteWAL(ctx, s, walID); err != nil {
          		return nil, fmt.Errorf("failed to commit WAL entry: %w", err)
          	}
          
          
	// Return the info!
          	resp := b.Secret(secretAccessKeyType).Response(map[string]interface{}{
          		"access_key":     *keyResp.AccessKey.AccessKeyId,
          		"secret_key":     *keyResp.AccessKey.SecretAccessKey,

// Remove the WAL entry, we succeeded! If we fail, we don't return  
	// the secret because it'll get rolled back anyways, so we have to return  
	// an error here.  
	if err := framework.DeleteWAL(ctx, s, walID); err != nil {  
		return nil, fmt.Errorf("failed to commit WAL entry: %w", err)  
	}

Does anyone have any guidance into where I should be looking next?

aram · October 6, 2021, 9:22am

A note, 1.4 is long been deprecated and is no longer supported. 1.6 is the oldest version supported and 1.8 is the current release. You’re looking at the code base of 1.8 … so just a word of caution. I would highly suggest upgrading.

Looks like vault is trying to delete something off of the consul backend, context cancelled usually means: “could not reach,” or “timeout”. Something is wrong somewhere but you need to do some more investigation (possibly upgrade). Vault and Consul are both very susceptible to very fine lines of timeouts so small changes to the performance/env/network routing can cause new issues that were previously not an issue.

mikeplem · October 6, 2021, 12:40pm

Thank you for the reply. I will look into upgrading and thank you for pointing out I was looking at code from version 1.8. I went back and looked at the code for our version and the same message exists.

I will dig further into Consul. Our consul service runs on the same service as Vault so, thankfully, we don’t have to worry about routing but performance issues are definitely a concern here.

aram · October 7, 2021, 9:14am

Consul (or Vault with Integrated storage) is VERY susceptible to I/O limits. One second it’s fine and the next it’s throwing a hissy fit with odd storage and context errors. You have to keep a very close eye on the I/O and time PUT times.

mikeplem · October 7, 2021, 12:04pm

What ranges of time is considered good and bad for PUT times? Is there any documentation that explains what to look for?

aram · October 7, 2021, 11:22pm

Not that I know of. Too many variables to account for. Keep a long history of what is good and when you get the errors so you can figure out what’s a good value for your environment.

mikeplem · October 8, 2021, 7:34pm

That is what we have done. We were not capturing all the metrics necessary but that is changing now.

Thank you for your help.

Topic		Replies	Views
"context canceled" Vault vault , consul-vault	8	4165	August 23, 2021
Vault write auth/aws/role/ causing vault to hang. (iam) Vault	5	1299	August 6, 2020
Vault sealed. failed to load local aliases from storage: Vault vault , consul	4	668	May 11, 2022
Permission denied: Missing service:write on vault Vault	6	3212	November 13, 2020
Containerize Vault and Consul Consul	6	682	January 25, 2023

Failed to commit WAL entry

Related topics