Failures to derive Vault token


We are facing an intermittent error, where jobs fail to start.
The job log shows that during the templating, nomad could not fetch vault secret, and just before it we see a “Could not derive Vault token: Context deadline exceeded”. The message appears 1 minute after the task is received by the client, so it indeed appears a timeout is reached.
We are running a Nomad, Vault and Consul cluster with 3 nodes, all 3 nodes have a server instance of each of the 3 applications running. The cluster, and all clients, are hosted on AWS in the same region and availability zone.
If we start the job immediately after the error, it starts without issues. The failures appear to be random, but we usually get 2-3 per week over 100-200 jobs started per week.

We see no network errors on the machines, nomad and vault logs don’t have any other info.

Any idea on how we can start debugging this situation?

Hi @JoaoPPinto,

Context deadline exceeded

As you point out, this error is indicative of a network call not completing before the configured deadline.

What level of telemetry and log collection do you have available? I wonder if there is something that could help identify what is causing the issue.

How are the Nomad clients configured to talk to the servers? Directly by IP, using a LB, or something else?

Apart from those initial questions, debugging is tricky and would require looking through all the available data to attempt to find patterns or data points that provide correlation.

The Nomad client configuration could be modified in order to provide longer timeout and retry intervals. This might be something to consider trying, in an effort to see if it at least helps resolve the issue. A downside being that the failure could take longer to show itself, and therefore resolve, so certainly something for dev clusters only.

jrasell and the Nomad team

Hello @jrasell,

When investigating another intermittent issue I believe we came across the problem and solution.

We have a python helper script to generate nomad and consul token through vault, and occasionally it would fail to generate tokens with the error:

URL: GET https://vault.<redacted>/v1/nomad/creds/<redacted>
Code: 500. Errors:

* 1 error occurred:
       * Put "https://<redacted>.elb.<region>": dial tcp i/o timeout

The error seems similar enough.
We have our Nomad, Vault and Consul cluster/Server nodes behind a Network Load Balancer, with target groups for each service.
Nomad server nodes are configured to contact Vault with the Network Load Balancer address. Additionally, Vault roles to create Nomad tokens are configured to contact Vault through the Network Load Balancer Address as well. Vault and Nomad Servers are configured to contact the local Consul agent running on each Node.

We did a quick test where we did a telnet to nomad target group from an instance within the Load Balancer, and one outside of it.
We observed that connections initiated from ec2 instances within the Load Balancer would occasionally hang and never respond. Connections initiated from ec2 instances outside the Load Balancer would always be established.

Upon a bit of searching through the Load Balancer documentation, we came across the following page:

Deactivating said option meant the simple test done above started passing as well.
We have not yet encountered the error in Nomad, but the option was only deactivated a day ago.

I suspect deactivating this option has no ill effects on neither Nomad nor Vault, apart from no longer being able to tell the source ip of a certain request.