"context canceled"

Hi,
We are using 2xVault and 3xconsul arcitecture.
Resently we had a downtime on our service, so we tring to understand what exacly happened.

Vault were responding with 500 to incomming requests (around 10000 requests), but in the audit log we see only ~300 errors, they all related to “failed to read … contex canceled”.
From the consul documentation it seems like this error can be related to IOPS, but in our case we are far from the limit.
The issue was resulved by itself within 15min.

  • No one of the vault/counsul process failed.
  • No sipke in traffic.
  • Seems like, no fault on the aws connected disks.
  • We are running on aws ec2 vms and 2 elbs.

Questions:

  1. What could cause this issue?
  2. Why if there is an issue with one of the consul/vaults there was no fallback to the standby one?
  3. Is there any way to avoid such issue in the future?

Are you able to get the full error message? The “failed to read … context canceled” seems to be missing some data in the middle that could be useful for assisting with your issue.

However, my first thought would be to see if there was any disruption in the ability to write to your audit devices. If Vault is unable to write to all configured audit devices then it will not allow any operations to process. More info on that here: Audit Devices | Vault by HashiCorp

Few error logs:

  1. /var/vcap/data/sys/log/vault/vault_audit.log
    {
    “time”:“2021-08-17T18:48:31.463624563Z”,
    “type”:“response”,
    “request”:{
    “id”:"…",
    “operation”:“update”,
    “client_token”:"…",
    “client_token_accessor”:"…",
    “path”:“auth/token/renew-self”
    },
    “response”:{},
    “error”:"1 error occurred:\n\t failed to read lease entry auth/token/create/h2b9ec5149c9397105c7bb97faeb6f80c8aa0a0c9c59915bbea4620f107e8f872: Get https://internal-consul-213558487.us-west-2.elb.amazonaws.com/v1/kv/vault/sys/expire/id/auth/token/create/h2b9ec5149c9397105c7bb97faeb6f80c8aa0a0c9c59915bbea4620f107e8f872: context canceled\n\n"*
    }

  2. /var/vcap/data/sys/log/monit/vault.err.log:66717:2021-08-17T18:34:59.014Z [ERROR]
    core: failed to run existence check:
    error=“existence check failed: Get https://internal-consul-238678873.us-west-2.elb.amazonaws.com/v1/kv/vault/logical/b831a89a-b02f-2085-df9b-cc44d5dc9b2d/81f37567-f14c-4289-817b-57b15ee24d2e/078221f7-da65-491c-9185-4d3f47442e9f/ee744fd9-bdfb-4a9b-bd6b-649b5adea0a2: context canceled”

checked the ‘audit devices’ with:
curl --header “X-Vault-Token:…” https://127.0.0.1:8200/v1/sys/audit
result attached.
audit_log.txt (1010 Bytes)

seems like we are writing to a local disk.
Then syslog is sending this data to splunk. so incase of network issue vault should just continue writing to disk rihgt? (unless there is an issue on the disk itself). right?

Yes, that is my understanding.

The errors look to be related to an issue communicating with your Consul storage backend. I checked the AWS Status page for any outages in us-west-2 EC2 or ELB (https://status.aws.amazon.com/) but didn’t see any documented outages around the time you had trouble.

Are you able to pull logs out of Consul for around the same time period to see if anything was happening within your storage environment?

Right, i checked also with aws they are saying no faults on their side.
There is also nothing relevant in consul logs.

also, anyway if there is an issue on the consul, I would be expecting to switch to another one. 15 min is a lot of time. shouldn’t that happen?

I’m much less familiar with Consul than I am with Vault. Maybe one of the HashiCorp crew can offer some perspective on that?

shouldn’t they respond here? or should i ask somewhere else?

If you’re using Vault Enterprise then I’d suggest opening a ticket in the support portal as you’re guaranteed a response there. Otherwise one of the HashiCorp staff may happen upon this thread and respond as they have time.

1 Like

We are using the free version for now.
Thanks. i’ll wait.