Unable to get a CSI volume registered

Hi,

so I wanted to get my hands dirty with the CSI volumes business a came across a quite typical roadblock with our company, which is the internal corporate proxy. In a nutshell all my AWS account traffic is routed via the on prem network, border control is enforced and we’re off into the www.

I was able to successfully deploy the AWS EBS CSI controller and client containers following the official guide. IAM policy is in place, all that that config stuff is taken care of…

  • Nomad version is 1.0.0
  • Both container running on the same machine

Logs from the nodes container:

I1210 12:55:34.132139 1 driver.go:68] Driver: ebs.csi.aws.com Version: v0.8.0
W1210 12:55:37.364275 1 metadata.go:136] Failed to parse the outpost arn:
I1210 12:55:37.364753 1 mount_linux.go:153] Detected OS without systemd
I1210 12:55:37.365614 1 driver.go:138] Listening for connections on address: &net.UnixAddr{Name:“/csi/csi.sock”, Net:“unix”}
I1210 12:55:40.161131 1 node.go:367] NodeGetInfo: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:55:40.162960 1 node.go:351] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:56:10.164083 1 node.go:351] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:56:40.165448 1 node.go:351] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:57:10.166583 1 node.go:351] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:57:40.167928 1 node.go:351] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:58:10.168914 1 node.go:351] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:58:40.170301 1 node.go:351] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}

Logs from the controller container:

I1210 12:54:58.651043 1 driver.go:68] Driver: ebs.csi.aws.com Version: v0.8.0
W1210 12:55:01.896980 1 metadata.go:136] Failed to parse the outpost arn:
I1210 12:55:01.897598 1 driver.go:138] Listening for connections on address: &net.UnixAddr{Name:“/csi/csi.sock”, Net:“unix”}
I1210 12:55:04.688972 1 controller.go:334] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:55:34.690141 1 controller.go:334] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:56:04.691342 1 controller.go:334] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:56:34.693708 1 controller.go:334] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}

So the next step would be to register the volume. And here’s where things get hairy. The command times out with the controller showing different log entries:

driver.go:115] GRPC error: rpc error: code = Internal desc = Could not get volume with ID “vol-06ec063b1287cb0cd”: RequestCanceled: request context canceled
caused by: context canceled

or

driver.go:115] GRPC error: rpc error: code = Internal desc = Could not get volume with ID “vol-06ec063b1287cb0cd”: RequestCanceled: request context canceled
caused by: context deadline exceeded

or

driver.go:115] GRPC error: rpc error: code = Internal desc = Could not get volume with ID “vol-06ec063b1287cb0cd”: RequestError: send request failed
caused by: Post “https://ec2.eu-central-1.amazonaws.com/”: dial tcp 54.239.55.102:443: i/o timeout

Well especially the last error had my attention.

I guess I have a couple of questions at this point:

  • is the startup log output from both of the container okay/normal?
  • in general which are the resources Nomad (server/client) needs to have access to in a AWS context (metadata endpoint is being one of them)
  • who makes the request that is blocked / how is the request flow when I register a volume?

To generally follow up on this, does it make sense to include a section somewhere in the docs where it is outlined what the external dependencies are when one uses CSI volumes on AWS/Azure/GCP or sort of make sure the Nomad clients have access to the metadata endpoint in order to get the az information or gcp.io needs to be accessible because the Envoys aren’t pulled from Docker Hub?

I’m just thinking out loud here I don’t even know if this is something you can come up with given that each setting out there is different…

Cheers

Hi @bfqrst! I’m going to answer your questions slightly out of order here :grinning:

  • is the startup log output from both of the container okay/normal?

Yup, that looks fine!

  • who makes the request that is blocked / how is the request flow when I register a volume?

From a high-level view, Nomad takes the request for a volume, finds a running plugin that can handle that request, and then hands off all the communication with the storage provider to the plugin. Nomad doesn’t really “know” anything about how the storage volumes are provisioned until they’re mounted by the Node plugin. We’ve got a more detailed description of this in Storage Plugins | Nomad | HashiCorp Developer

  • in general which are the resources Nomad (server/client) needs to have access to in a AWS context (metadata endpoint is being one of them)

Ok, so now that we understand the overall workflow… you’re running into an unfortunate problem with CSI plugins in general, which is that most of them don’t specify what resources they need and it’s up to experimentation to figure it out.

We have a collection of demo CSI jobs in: ./demo/csi but that doesn’t include the AWS plugins yet. I have these in our E2E test suite ./e2e/csi.

So for the AWS EBS plugin, the plugin needs credentials to attach the EBS volume. You can inject those credentials via AWS_ACCESS_KEY_ID/AWS_SECRET_KEY, or give the host an IAM role and make sure the container has access to the AWS metadata endpoint.

Here’s a slightly-redacted version of the IAM role and policy we use for our E2E test cluster so that it can use the AWS EBS and AWS EFS plugins. I’ll make sure this lands in that demos folder once I add the E2E example there.

# This role is the one used by the example test cluster and is attached as an
# instance role.
resource "aws_iam_role" "nomad_example_cluster" {
  description        = "IAM role for example clusters"
  name               = "nomad_example_cluster"
  path               = "/"
  assume_role_policy = data.aws_iam_policy_document.assume_role_nomad_example_cluster.json

  tags = {
    source = "github.com/hashicorp/<redacted>"
  }
}

data "aws_iam_policy_document" "assume_role_nomad_example_cluster" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["ec2.amazonaws.com"]
    }
  }
}

# allow the role to be attached to an AWS instance so that the instance can
# make its own AWS API calls
resource "aws_iam_instance_profile" "nomad_example_cluster" {
  name = "nomad_example_cluster"
  role = aws_iam_role.nomad_example_cluster.name
}

# attach the policy to the instance role
resource "aws_iam_role_policy" "nomad_example_cluster" {
  name   = "nomad_example_cluster"
  role   = aws_iam_role.nomad_example_cluster.id
  policy = data.aws_iam_policy_document.nomad_example_cluster.json
}

# This policy allows this instance to autodiscover the rest of the cluster
# and use CSI volumes.
data "aws_iam_policy_document" "nomad_example_cluster" {

  statement {
    effect = "Allow"

    actions = [
      "ec2:DescribeInstances",
      "ec2:DescribeTags",
      "ec2:DescribeVolume*",
      "ec2:AttachVolume",
      "ec2:DetachVolume",
      "autoscaling:DescribeAutoScalingGroups",
    ]
    resources = ["*"]
  }

  statement {
    effect    = "Allow"

    actions = [
      "kms:Encrypt",
      "kms:Decrypt",
      "kms:DescribeKey",
    ]

    resources = [
      aws_kms_key.example.arn
    ]
  }

}

1 Like

Thanks for clarifying @tgross and sorry for my late answer. Life, work and pandemics sorta happening along the way…

I took another stab at the whole AWS CSI EBS situation and actually got it working (with Nomad 1.0.5, more on that later). As I suspected, the EBS plugin container couldn’t get to the metadata endpoint to query the region and presumably some other stuff (more on that later). After setting the AWS_REGION and the complete gang of [HTTP,HTTPS,SOCKS]_PROXY vars, I was finally able to reliably register and deregister premade volumes and attach them to workloads. The saying goes it’s always DNS? In our shop it’s that AND the corporate proxy!

Fast forward to a couple of days ago, where Nomad 1.1.0 went GA.

  • I deregistered all the volumes
  • I removed all the AWS CSI plugins
  • I upgraded all of Nomad to 1.1.0
  • AWS CSI EBS plugin went 1.0 GA as well

Because I was reliably able to register / deregister volumes, the plan was/is to deploy the controller plugin with a count of 2 and deploy the nodes plugin as a system job to all of Nomads workers… Spoiler: it went sideways :wink:

A couple of observations:

  1. While the Nomad 1.1.0 changelog reflects the need of a capability block with the attachment_mode and access_mode fields set, the learning guide uses the old syntax which is rejected by Nomad 1.1.0 . The Nomad E2E tests offer some good insights as to how to configure this, so that’s out of the way…
  2. The AWS EBS CSI 1.0 plugin will start just fine in controller mode, but will panic if started in nodes mode. Here too, it seems to have something to do with it not being able to read metadata. I used the 0.10 version for node-mode, which started as expected
  3. With a updated volume declaration as follows, 2 plugin controller copies, system job nodes copies, I’m now unable to register volumes again with a Error registering volume: Unexpected response code: 500 (rpc error: rpc error: controller validate volume: CSI.ControllerValidateVolume: unknown volume attachment mode: ) error. Strangely enough, repeating the command, the error is the same in meaning but uses a slightly different wording: Error registering volume: Unexpected response code: 500 (rpc error: controller validate volume: CSI.ControllerValidateVolume: unknown volume attachment mode: ). Third version : Error registering volume: Unexpected response code: 500 (rpc error: controller validate volume: rpc error: controller validate volume: CSI.ControllerValidateVolume: unknown volume attachment mode: )
  4. In contrast to errors I had previous, no error log output comes from the plugins itself this time. Nomad seems to interfere here.
type            = "csi"
id              = "gitea"
name            = "gitea"
plugin_id       = "aws-ebs0"
external_id     = "vol-redacted"

capability {
   access_mode     = "single-node-writer"
   attachment_mode = "file-system"
 }

capability {
  access_mode     = "single-node-writer"
  attachment_mode = "block-device"
}

parameters {
    type = "gp2"
}

Does it make sense to run the e2e tests against the 1.0 GA plugin version?

Anyway, I’d love to hear your take on this.

Cheers

EDIT: It would appear that this register thing actually is a bug per issue tracker…

…meanwhile on the AWS-EBS-CSI-DRIVER repo…

It appears as if support for http_proxy and no_proxy stuff landed in v0.10 as per this pull request. So for all of you behind corporate proxies, v0.10.x and beyond is your version. This might explain the trouble I described in my original post here. Or, put differently, explains why it worked half a year later with 1.0.0.

So what remains is this issue here, regarding the registration of the volume.

Updating this thread to better reflect the progress on this… So Nomad 1.1.1 landed which solved the volume registration issue I had. AWS EBS CSI 1.1.0 driver also landed which solved this issue I had:

So basically I have two jobs running with attached CSI volumes. So far so good. What remains? Two things as of right now:

  1. When you reboot the node which hosts the CSI plugin, Nomad doesn’t seem to be able to bring it back! There is weird visual glitch where it said: 1/0 available. So basically it comes back, but some sort of validation fails. I might have even seen some work around this type of issue… You need to stop and restart the CSI job to be able to use it.
  2. Updating a job which has CSI volumes attached fails a couple of times until it becomes healthy eventually. I need to observe this one more…

Anyway, this is it for now…