Unable to get a CSI volume registered

Hi,

so I wanted to get my hands dirty with the CSI volumes business a came across a quite typical roadblock with our company, which is the internal corporate proxy. In a nutshell all my AWS account traffic is routed via the on prem network, border control is enforced and we’re off into the www.

I was able to successfully deploy the AWS EBS CSI controller and client containers following the official guide. IAM policy is in place, all that that config stuff is taken care of…

  • Nomad version is 1.0.0
  • Both container running on the same machine

Logs from the nodes container:

I1210 12:55:34.132139 1 driver.go:68] Driver: ebs.csi.aws.com Version: v0.8.0
W1210 12:55:37.364275 1 metadata.go:136] Failed to parse the outpost arn:
I1210 12:55:37.364753 1 mount_linux.go:153] Detected OS without systemd
I1210 12:55:37.365614 1 driver.go:138] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:“unix”}
I1210 12:55:40.161131 1 node.go:367] NodeGetInfo: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:55:40.162960 1 node.go:351] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:56:10.164083 1 node.go:351] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:56:40.165448 1 node.go:351] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:57:10.166583 1 node.go:351] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:57:40.167928 1 node.go:351] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:58:10.168914 1 node.go:351] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:58:40.170301 1 node.go:351] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}

Logs from the controller container:

I1210 12:54:58.651043 1 driver.go:68] Driver: ebs.csi.aws.com Version: v0.8.0
W1210 12:55:01.896980 1 metadata.go:136] Failed to parse the outpost arn:
I1210 12:55:01.897598 1 driver.go:138] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:“unix”}
I1210 12:55:04.688972 1 controller.go:334] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:55:34.690141 1 controller.go:334] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:56:04.691342 1 controller.go:334] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}
I1210 12:56:34.693708 1 controller.go:334] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized: XXX_sizecache:0}

So the next step would be to register the volume. And here’s where things get hairy. The command times out with the controller showing different log entries:

driver.go:115] GRPC error: rpc error: code = Internal desc = Could not get volume with ID “vol-06ec063b1287cb0cd”: RequestCanceled: request context canceled
caused by: context canceled

or

driver.go:115] GRPC error: rpc error: code = Internal desc = Could not get volume with ID “vol-06ec063b1287cb0cd”: RequestCanceled: request context canceled
caused by: context deadline exceeded

or

driver.go:115] GRPC error: rpc error: code = Internal desc = Could not get volume with ID “vol-06ec063b1287cb0cd”: RequestError: send request failed
caused by: Post “https://ec2.eu-central-1.amazonaws.com/”: dial tcp 54.239.55.102:443: i/o timeout

Well especially the last error had my attention.

I guess I have a couple of questions at this point:

  • is the startup log output from both of the container okay/normal?
  • in general which are the resources Nomad (server/client) needs to have access to in a AWS context (metadata endpoint is being one of them)
  • who makes the request that is blocked / how is the request flow when I register a volume?

To generally follow up on this, does it make sense to include a section somewhere in the docs where it is outlined what the external dependencies are when one uses CSI volumes on AWS/Azure/GCP or sort of make sure the Nomad clients have access to the metadata endpoint in order to get the az information or gcp.io needs to be accessible because the Envoys aren’t pulled from Docker Hub?

I’m just thinking out loud here I don’t even know if this is something you can come up with given that each setting out there is different…

Cheers

Hi @bfqrst! I’m going to answer your questions slightly out of order here :grinning:

  • is the startup log output from both of the container okay/normal?

Yup, that looks fine!

  • who makes the request that is blocked / how is the request flow when I register a volume?

From a high-level view, Nomad takes the request for a volume, finds a running plugin that can handle that request, and then hands off all the communication with the storage provider to the plugin. Nomad doesn’t really “know” anything about how the storage volumes are provisioned until they’re mounted by the Node plugin. We’ve got a more detailed description of this in https://www.nomadproject.io/docs/internals/plugins/csi

  • in general which are the resources Nomad (server/client) needs to have access to in a AWS context (metadata endpoint is being one of them)

Ok, so now that we understand the overall workflow… you’re running into an unfortunate problem with CSI plugins in general, which is that most of them don’t specify what resources they need and it’s up to experimentation to figure it out.

We have a collection of demo CSI jobs in: ./demo/csi but that doesn’t include the AWS plugins yet. I have these in our E2E test suite ./e2e/csi.

So for the AWS EBS plugin, the plugin needs credentials to attach the EBS volume. You can inject those credentials via AWS_ACCESS_KEY_ID/AWS_SECRET_KEY, or give the host an IAM role and make sure the container has access to the AWS metadata endpoint.

Here’s a slightly-redacted version of the IAM role and policy we use for our E2E test cluster so that it can use the AWS EBS and AWS EFS plugins. I’ll make sure this lands in that demos folder once I add the E2E example there.

# This role is the one used by the example test cluster and is attached as an
# instance role.
resource "aws_iam_role" "nomad_example_cluster" {
  description        = "IAM role for example clusters"
  name               = "nomad_example_cluster"
  path               = "/"
  assume_role_policy = data.aws_iam_policy_document.assume_role_nomad_example_cluster.json

  tags = {
    source = "github.com/hashicorp/<redacted>"
  }
}

data "aws_iam_policy_document" "assume_role_nomad_example_cluster" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["ec2.amazonaws.com"]
    }
  }
}

# allow the role to be attached to an AWS instance so that the instance can
# make its own AWS API calls
resource "aws_iam_instance_profile" "nomad_example_cluster" {
  name = "nomad_example_cluster"
  role = aws_iam_role.nomad_example_cluster.name
}

# attach the policy to the instance role
resource "aws_iam_role_policy" "nomad_example_cluster" {
  name   = "nomad_example_cluster"
  role   = aws_iam_role.nomad_example_cluster.id
  policy = data.aws_iam_policy_document.nomad_example_cluster.json
}

# This policy allows this instance to autodiscover the rest of the cluster
# and use CSI volumes.
data "aws_iam_policy_document" "nomad_example_cluster" {

  statement {
    effect = "Allow"

    actions = [
      "ec2:DescribeInstances",
      "ec2:DescribeTags",
      "ec2:DescribeVolume*",
      "ec2:AttachVolume",
      "ec2:DetachVolume",
      "autoscaling:DescribeAutoScalingGroups",
    ]
    resources = ["*"]
  }

  statement {
    effect    = "Allow"

    actions = [
      "kms:Encrypt",
      "kms:Decrypt",
      "kms:DescribeKey",
    ]

    resources = [
      aws_kms_key.example.arn
    ]
  }

}

1 Like