Hi everyone,
I am facing issues with the aws_eks_node_group resource. It was implemented nearly a year ago and it worked well until now. Our last successful deployment was on June, 3rd 2024.
Now I am facing issues with the ec2 instance, which fails to start. This could be related to an issue creating the ebs volume. I trimmed the config to a minimum and it reproduces the issue.
Here is my terraform version and the used providers:
terraform -version
Terraform v1.9.1
on linux_amd64
- provider Terraform Registry v5.57.0
Errors:
Terraform apply error:
Error: waiting for EKS Node Group (debug-ng-ws:eks-debug-ng-ws-ng-private) create: unexpected state ‘CREATE_FAILED’, wanted target ‘ACTIVE’. last error: eks-eks-debug-ng-ws-ng-private-26c84b42-80de-d6a8-2232-bc830b0a6f0f: AsgInstanceLaunchFailures: Instance became unhealthy while waiting for instance to be in InService state. Termination Reason: Client.InvalidKMSKey.InvalidState: The KMS key provided is in an incorrect state
i-0369e57d7ef547cd3, i-0430ea232f7bdf688, i-0566cf2c9faec50c6, i-08690ad964551bbeb, i-0b52af5dd5674724d: NodeCreationFailure: Instances failed to join the kubernetes cluster
with aws_eks_node_group.eks_ng_public,
on main.tf line 120, in resource “aws_eks_node_group” “eks_ng_public”:
120: resource “aws_eks_node_group” “eks_ng_public” {
–
EC2 instance shows following error:
State transition reason: Client.InternalError
State transition message: Client.InvalidKMSKey.InvalidState: The KMS key provided is in an incorrect state
I was wondering, because I don’t use a kms key for this ebs volume. Even using a kms key for the ebs did not change anything. Then I continued checking the volumes. I was not able to find the volumes, even when I activated delete_on_termination = false.
And this is the point, where I got stuck. I assume, that the ebs volume cannot be created during the startup, but I don’t find a related message, that there was an error in creating it. I went through cloudtrail events but did not find anything wrong there yet (may have to dig deeper there). I should also not have reached any quotas, because i tried to create a normal ec2 instance based on your aws_ec2_instance resource in parallel with ebs storage and this worked.
terraform config:
locals {
cluster_name = terraform.workspace
}
resource "aws_eks_cluster" "example" {
name = local.cluster_name
role_arn = aws_iam_role.example.arn
version = "1.30"
vpc_config {
subnet_ids = ["<SUBNET_1>", "<SUBNET_2>"]
}
# Ensure that IAM Role permissions are created before and deleted after EKS Cluster handling.
# Otherwise, EKS will not be able to properly delete EKS managed EC2 infrastructure such as Security Groups.
depends_on = [
aws_iam_role_policy_attachment.example-AmazonEKSClusterPolicy,
aws_iam_role_policy_attachment.example-AmazonEKSVPCResourceController,
]
}
output "endpoint" {
value = aws_eks_cluster.example.endpoint
}
data "aws_iam_policy_document" "assume_role" {
statement {
effect = "Allow"
principals {
type = "Service"
identifiers = ["eks.amazonaws.com"]
}
actions = ["sts:AssumeRole"]
}
}
resource "aws_iam_role" "example" {
name = "eks-cluster-example"
assume_role_policy = data.aws_iam_policy_document.assume_role.json
}
resource "aws_iam_role_policy_attachment" "example-AmazonEKSClusterPolicy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
role = aws_iam_role.example.name
}
# Optionally, enable Security Groups for Pods
# Reference: https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html
resource "aws_iam_role_policy_attachment" "example-AmazonEKSVPCResourceController" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSVPCResourceController"
role = aws_iam_role.example.name
}
resource "aws_iam_role_policy_attachment" "eks-AmazonEKSWorkerNodePolicy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
role = aws_iam_role.eks_nodegroup_role.name
}
resource "aws_iam_role_policy_attachment" "eks-AmazonEKS_CNI_Policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
role = aws_iam_role.eks_nodegroup_role.name
}
resource "aws_iam_role_policy_attachment" "eks-AmazonEC2ContainerRegistryReadOnly" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
role = aws_iam_role.eks_nodegroup_role.name
}
data "aws_iam_policy_document" "eks_nodegroup_role" {
statement {
effect = "Allow"
actions = [
"sts:AssumeRole"
]
principals {
type = "Service"
identifiers = [
"ec2.amazonaws.com"
]
}
}
}
resource "aws_iam_role" "eks_nodegroup_role" {
name = "eks-${local.cluster_name}-nodegroup-role"
assume_role_policy = data.aws_iam_policy_document.eks_nodegroup_role.json
}
data "aws_ssm_parameter" "eks_ami_release_version" {
name = "/aws/service/eks/optimized-ami/1.30/amazon-linux-2/recommended/release_version"
}
resource "aws_launch_template" "eks_nodegroup_unblu_node" {
name_prefix = "${local.cluster_name}-eks-nodegroup-"
instance_type = "t3.medium"
update_default_version = true
block_device_mappings {
device_name = "/dev/xvda"
ebs {
volume_size = 40
delete_on_termination = false
volume_type = "gp3"
}
}
tag_specifications {
resource_type = "instance"
tags = {
Name = "${local.cluster_name}-eks-nodegroup-instance"
}
}
}
resource "aws_eks_node_group" "eks_ng_public" {
cluster_name = local.cluster_name
node_group_name = "eks-${local.cluster_name}-ng-private"
node_role_arn = aws_iam_role.eks_nodegroup_role.arn
subnet_ids = ["<SUBNET_1>", "<SUBNET_2>"]
release_version = nonsensitive(data.aws_ssm_parameter.eks_ami_release_version.value)
ami_type = "AL2_x86_64"
capacity_type = "ON_DEMAND"
launch_template {
id = aws_launch_template.eks_nodegroup_unblu_node.id
version = aws_launch_template.eks_nodegroup_unblu_node.default_version
}
scaling_config {
desired_size = 1
min_size = 1
max_size = 2
}
update_config {
max_unavailable = 1
}
depends_on = [
aws_iam_role_policy_attachment.eks-AmazonEKSWorkerNodePolicy,
aws_iam_role_policy_attachment.eks-AmazonEKS_CNI_Policy,
aws_iam_role_policy_attachment.eks-AmazonEC2ContainerRegistryReadOnly,
]
}
Autoscaling Group config in aws:
{
"AutoScalingGroupName": "eks-eks-debug-ng-ws-ng-private-26c84b42-80de-d6a8-2232-bc830b0a6f0f",
"AutoScalingGroupARN": "arn:aws:autoscaling:eu-central-1:<acc_id>:autoScalingGroup:1bd32e46-b892-41ac-b685-b8f296b8d44b:autoScalingGroupName/eks-eks-debug-ng-ws-ng-private-26c84b42-80de-d6a8-2232-bc830b0a6f0f",
"MixedInstancesPolicy": {
"LaunchTemplate": {
"LaunchTemplateSpecification": {
"LaunchTemplateId": "lt-07b30ca1a5f8d78ec",
"LaunchTemplateName": "eks-26c84b42-80de-d6a8-2232-bc830b0a6f0f",
"Version": "1"
},
"Overrides": []
},
"InstancesDistribution": {
"OnDemandAllocationStrategy": "prioritized",
"OnDemandBaseCapacity": 0,
"OnDemandPercentageAboveBaseCapacity": 100,
"SpotAllocationStrategy": "lowest-price",
"SpotInstancePools": 2
}
},
"MinSize": 1,
"MaxSize": 2,
"DesiredCapacity": 1,
"DefaultCooldown": 300,
"AvailabilityZones": [
"eu-central-1a",
"eu-central-1b"
],
"LoadBalancerNames": [],
"TargetGroupARNs": [],
"HealthCheckType": "EC2",
"HealthCheckGracePeriod": 15,
"Instances": [],
"CreatedTime": "2024-07-09T08:04:19.473000+00:00",
"SuspendedProcesses": [],
"VPCZoneIdentifier": "subnet-0d6292d46fdd74b72,subnet-046165e315ef7acb7",
"EnabledMetrics": [
{
"Metric": "GroupTotalInstances",
"Granularity": "1Minute"
},
{
"Metric": "GroupMaxSize",
"Granularity": "1Minute"
},
{
"Metric": "GroupDesiredCapacity",
"Granularity": "1Minute"
},
{
"Metric": "GroupInServiceCapacity",
"Granularity": "1Minute"
},
{
"Metric": "GroupMinSize",
"Granularity": "1Minute"
},
{
"Metric": "GroupStandbyCapacity",
"Granularity": "1Minute"
},
{
"Metric": "GroupTerminatingCapacity",
"Granularity": "1Minute"
},
{
"Metric": "WarmPoolWarmedCapacity",
"Granularity": "1Minute"
},
{
"Metric": "GroupAndWarmPoolTotalCapacity",
"Granularity": "1Minute"
},
{
"Metric": "WarmPoolTotalCapacity",
"Granularity": "1Minute"
},
{
"Metric": "GroupInServiceInstances",
"Granularity": "1Minute"
},
{
"Metric": "GroupStandbyInstances",
"Granularity": "1Minute"
},
{
"Metric": "WarmPoolTerminatingCapacity",
"Granularity": "1Minute"
},
{
"Metric": "WarmPoolPendingCapacity",
"Granularity": "1Minute"
},
{
"Metric": "GroupTotalCapacity",
"Granularity": "1Minute"
},
{
"Metric": "GroupPendingInstances",
"Granularity": "1Minute"
},
{
"Metric": "GroupPendingCapacity",
"Granularity": "1Minute"
},
{
"Metric": "WarmPoolMinSize",
"Granularity": "1Minute"
},
{
"Metric": "GroupTerminatingInstances",
"Granularity": "1Minute"
},
{
"Metric": "GroupAndWarmPoolDesiredCapacity",
"Granularity": "1Minute"
},
{
"Metric": "WarmPoolDesiredCapacity",
"Granularity": "1Minute"
}
],
"Tags": [
{
"ResourceId": "eks-eks-debug-ng-ws-ng-private-26c84b42-80de-d6a8-2232-bc830b0a6f0f",
"ResourceType": "auto-scaling-group",
"Key": "eks:cluster-name",
"Value": "debug-ng-ws",
"PropagateAtLaunch": true
},
{
"ResourceId": "eks-eks-debug-ng-ws-ng-private-26c84b42-80de-d6a8-2232-bc830b0a6f0f",
"ResourceType": "auto-scaling-group",
"Key": "eks:nodegroup-name",
"Value": "eks-debug-ng-ws-ng-private",
"PropagateAtLaunch": true
},
{
"ResourceId": "eks-eks-debug-ng-ws-ng-private-26c84b42-80de-d6a8-2232-bc830b0a6f0f",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/debug-ng-ws",
"Value": "owned",
"PropagateAtLaunch": true
},
{
"ResourceId": "eks-eks-debug-ng-ws-ng-private-26c84b42-80de-d6a8-2232-bc830b0a6f0f",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/enabled",
"Value": "true",
"PropagateAtLaunch": true
},
{
"ResourceId": "eks-eks-debug-ng-ws-ng-private-26c84b42-80de-d6a8-2232-bc830b0a6f0f",
"ResourceType": "auto-scaling-group",
"Key": "kubernetes.io/cluster/debug-ng-ws",
"Value": "owned",
"PropagateAtLaunch": true
}
],
"TerminationPolicies": [
"AllocationStrategy",
"OldestLaunchTemplate",
"OldestInstance"
],
"NewInstancesProtectedFromScaleIn": false,
"ServiceLinkedRoleARN": "arn:aws:iam::<acc_id>:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling",
"CapacityRebalance": true,
"TrafficSources": []
}
RunInstances Autostacling event (from cloudtrail)
{
"eventVersion": "1.09",
"userIdentity": {
"type": "AssumedRole",
"principalId": "xx:AutoScaling",
"arn": "arn:aws:sts::<account_id>:assumed-role/AWSServiceRoleForAutoScaling/AutoScaling",
"accountId": "<account_id>",
"sessionContext": {
"sessionIssuer": {
"type": "Role",
"principalId": "xx",
"arn": "arn:aws:iam::<account_id>:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling",
"accountId": "<account_id>",
"userName": "AWSServiceRoleForAutoScaling"
},
"attributes": {
"creationDate": "2024-07-09T08:05:43Z",
"mfaAuthenticated": "false"
}
},
"invokedBy": "autoscaling.amazonaws.com"
},
"eventTime": "2024-07-09T08:05:45Z",
"eventSource": "ec2.amazonaws.com",
"eventName": "RunInstances",
"awsRegion": "eu-central-1",
"sourceIPAddress": "autoscaling.amazonaws.com",
"userAgent": "autoscaling.amazonaws.com",
"requestParameters": {
"instancesSet": {
"items": [
{
"minCount": 1,
"maxCount": 1
}
]
},
"instanceType": "t3.medium",
"blockDeviceMapping": {},
"availabilityZone": "eu-central-1b",
"monitoring": {
"enabled": false
},
"subnetId": "<>",
"disableApiTermination": false,
"disableApiStop": false,
"clientToken": "<>",
"tagSpecificationSet": {
"items": [
{
"resourceType": "instance",
"tags": [
{
"key": "aws:autoscaling:groupName",
"value": "eks-eks-debug-ng-ws-ng-private-26c84b42-80de-d6a8-2232-bc830b0a6f0f"
},
{
"key": "k8s.io/cluster-autoscaler/debug-ng-ws",
"value": "owned"
},
{
"key": "eks:cluster-name",
"value": "debug-ng-ws"
},
{
"key": "eks:nodegroup-name",
"value": "eks-debug-ng-ws-ng-private"
},
{
"key": "k8s.io/cluster-autoscaler/enabled",
"value": "true"
},
{
"key": "kubernetes.io/cluster/debug-ng-ws",
"value": "owned"
},
{
"key": "aws:ec2:fleet-id",
"value": "fleet-a43dbc0f-30ae-441e-0eba-a520ab8fcd00"
}
]
}
]
},
"launchTemplate": {
"launchTemplateId": "lt-07b30ca1a5f8d78ec",
"version": "1"
}
},
"responseElements": {
"requestId": "ef8fc7af-0592-4ef4-b3f0-a97bf4969d44",
"reservationId": "r-0c158e431062c416e",
"ownerId": "<account_id>",
"groupSet": {},
"instancesSet": {
"items": [
{
"instanceId": "i-0369e57d7ef547cd3",
"imageId": "ami-01cef8fa4f67335f1",
"currentInstanceBootMode": "legacy-bios",
"instanceState": {
"code": 0,
"name": "pending"
},
"privateDnsName": "ip-10-219-67-34.eu-central-1.compute.internal",
"amiLaunchIndex": 0,
"productCodes": {},
"instanceType": "t3.medium",
"launchTime": 1720512344000,
"placement": {
"availabilityZone": "eu-central-1b",
"tenancy": "default"
},
"monitoring": {
"state": "disabled"
},
"subnetId": "subnet-0d6292d46fdd74b72",
"vpcId": "vpc-0ba99d1beb2336a47",
"privateIpAddress": "<>",
"stateReason": {
"code": "pending",
"message": "pending"
},
"architecture": "x86_64",
"rootDeviceType": "ebs",
"rootDeviceName": "/dev/xvda",
"blockDeviceMapping": {},
"virtualizationType": "hvm",
"hypervisor": "xen",
"tagSet": {
"items": [
{
"key": "aws:autoscaling:groupName",
"value": "eks-eks-debug-ng-ws-ng-private-26c84b42-80de-d6a8-2232-bc830b0a6f0f"
},
{
"key": "k8s.io/cluster-autoscaler/debug-ng-ws",
"value": "owned"
},
{
"key": "kubernetes.io/cluster/debug-ng-ws",
"value": "owned"
},
{
"key": "aws:ec2launchtemplate:version",
"value": "1"
},
{
"key": "eks:cluster-name",
"value": "debug-ng-ws"
},
{
"key": "Name",
"value": "debug-ng-ws-eks-nodegroup-instance"
},
{
"key": "k8s.io/cluster-autoscaler/enabled",
"value": "true"
},
{
"key": "eks:nodegroup-name",
"value": "eks-debug-ng-ws-ng-private"
},
{
"key": "aws:ec2:fleet-id",
"value": "fleet-a43dbc0f-30ae-441e-0eba-a520ab8fcd00"
},
{
"key": "aws:ec2launchtemplate:id",
"value": "lt-07b30ca1a5f8d78ec"
}
]
},
"clientToken": "fleet-a43dbc0f-30ae-441e-0eba-a520ab8fcd00-0",
"groupSet": {
"items": [
{
"groupId": "sg-011e1d01e3f803100",
"groupName": "eks-cluster-sg-debug-ng-ws-1645544310"
}
]
},
"sourceDestCheck": true,
"networkInterfaceSet": {
"items": [
{
"networkInterfaceId": "eni-0f0e7c74e2c881539",
"subnetId": "<>",
"vpcId": "vpc-xx",
"ownerId": "<account_id>",
"status": "in-use",
"macAddress": "06:2f:35:ac:26:41",
"privateIpAddress": "10.219.67.34",
"privateDnsName": "ip-10-219-67-34.eu-central-1.compute.internal",
"sourceDestCheck": true,
"interfaceType": "interface",
"groupSet": {
"items": [
{
"groupId": "sg-011e1d01e3f803100",
"groupName": "eks-cluster-sg-debug-ng-ws-1645544310"
}
]
},
"attachment": {
"attachmentId": "eni-attach-0d5f7017b37945c11",
"deviceIndex": 0,
"networkCardIndex": 0,
"status": "attaching",
"attachTime": 1720512344000,
"deleteOnTermination": true
},
"privateIpAddressesSet": {
"item": [
{
"privateIpAddress": "10.219.67.34",
"privateDnsName": "ip-10-219-67-34.eu-central-1.compute.internal",
"primary": true
}
]
},
"ipv6AddressesSet": {},
"tagSet": {}
}
]
},
"iamInstanceProfile": {
"arn": "arn:aws:iam::<>:instance-profile/eks-26c84b42-80de-d6a8-2232-bc830b0a6f0f",
"id": "AIPASTEBMOGY2ZQRWZWFM"
},
"ebsOptimized": false,
"enaSupport": true,
"cpuOptions": {
"coreCount": 1,
"threadsPerCore": 2
},
"capacityReservationSpecification": {
"capacityReservationPreference": "open"
},
"enclaveOptions": {
"enabled": false
},
"metadataOptions": {
"state": "pending",
"httpTokens": "optional",
"httpPutResponseHopLimit": 2,
"httpEndpoint": "enabled",
"httpProtocolIpv4": "enabled",
"httpProtocolIpv6": "disabled",
"instanceMetadataTags": "disabled"
},
"maintenanceOptions": {
"autoRecovery": "default"
},
"privateDnsNameOptions": {
"hostnameType": "ip-name",
"enableResourceNameDnsARecord": false,
"enableResourceNameDnsAAAARecord": false
}
}
]
},
"requesterId": "031357697034"
},
"requestID": "ef8fc7af-0592-4ef4-b3f0-a97bf4969d44",
"eventID": "2645cfa4-64d3-4448-8402-6ddaef6fb02a",
"readOnly": false,
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "<>",
"eventCategory": "Management"
}
Launch Template:
{
"LaunchTemplateVersions": [
{
"LaunchTemplateId": "lt-07b30ca1a5f8d78ec",
"LaunchTemplateName": "eks-26c84b42-80de-d6a8-2232-bc830b0a6f0f",
"VersionNumber": 1,
"CreateTime": "2024-07-09T08:03:46+00:00",
"CreatedBy": "arn:aws:sts::<acc_id>:assumed-role/AWSServiceRoleForAmazonEKSNodegroup/EKS",
"DefaultVersion": true,
"LaunchTemplateData": {
"IamInstanceProfile": {
"Name": "eks-26c84b42-80de-d6a8-2232-bc830b0a6f0f"
},
"BlockDeviceMappings": [
{
"DeviceName": "/dev/xvda",
"Ebs": {
"DeleteOnTermination": true,
"VolumeSize": 40,
"VolumeType": "gp3"
}
}
],
"ImageId": "ami-01cef8fa4f67335f1",
"InstanceType": "t3.medium",
"UserData": "<udata>",
"TagSpecifications": [
{
"ResourceType": "volume",
"Tags": [
{
"Key": "eks:nodegroup-name",
"Value": "eks-debug-ng-ws-ng-private"
},
{
"Key": "eks:cluster-name",
"Value": "debug-ng-ws"
}
]
},
{
"ResourceType": "instance",
"Tags": [
{
"Key": "eks:nodegroup-name",
"Value": "eks-debug-ng-ws-ng-private"
},
{
"Key": "Name",
"Value": "debug-ng-ws-eks-nodegroup-instance"
},
{
"Key": "eks:cluster-name",
"Value": "debug-ng-ws"
}
]
}
],
"SecurityGroupIds": [
"sg-011e1d01e3f803100"
],
"MetadataOptions": {
"HttpPutResponseHopLimit": 2,
"InstanceMetadataTags": "disabled"
}
}
}
]
}
used AMI:
aws ec2 describe-images --region eu-central-1 --image-ids ami-01cef8fa4f67335f1
{
"Images": [
{
"Architecture": "x86_64",
"CreationDate": "2024-06-25T19:33:01.000Z",
"ImageId": "ami-01cef8fa4f67335f1",
"ImageLocation": "amazon/amazon-eks-node-1.30-v20240625",
"ImageType": "machine",
"Public": true,
"OwnerId": "602401143452",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/xvda",
"Ebs": {
"DeleteOnTermination": true,
"SnapshotId": "snap-053b6879c5e2d06db",
"VolumeSize": 20,
"VolumeType": "gp2",
"Encrypted": false
}
}
],
"Description": "EKS Kubernetes Worker AMI with AmazonLinux2 image, (k8s: 1.30.0, containerd: 1.7.*)",
"EnaSupport": true,
"Hypervisor": "xen",
"ImageOwnerAlias": "amazon",
"Name": "amazon-eks-node-1.30-v20240625",
"RootDeviceName": "/dev/xvda",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm",
"DeprecationTime": "2026-06-25T19:33:01.000Z"
}
]
}
Would be glad, if someone has an idea.