Unable to create gpu vm in AKS default node pool using Terraform

mark25913 · October 30, 2022, 10:29am

My target:
Using Terraform to create a gpu node pool within gpu vm (Standard_NC6s_v3)

My Terraform result:
It seem worked in gitlab CICD pipeline successfully (terraform validate, plan and apply were all passed), however the node was only showed in vmss of Azure nodes group not in AKS cluster node (node count=1, Provisioning state: successful, Power state: Running 0/0 nodes ready). I used k8s API (kubectl get nodes) to check the node in my cluster, there was no gpu node in the gpu node pool.

My gpu test:
Because there is no error log message in the gitlab CICD pipeline, I used extreme example to test what kind reason cause the gpu node failed. I tried to create a gpu node in AKS default node pool using Terraform, and I got the different result of Azure portal (node count=1, Provisioning state: Failed, Power state: Running 0/0 nodes ready). However I checked the events of cluster, I got the error message this time:

Reason: InvalidDiskCapacity, 
Message: invalid capacity 0 on image filesystem

I also used k8s API (kubectl get nodes) to check node info, but there was still no node in my cluster (the gpu vm was only showed in vmss of Azure nodes group).

Therefore I tried to use “az cli” with the same configuration to test this gpu creation in AKS, the node was created in default node pool successfully! (Then I used “az cli” to create a new gpu node pool and add the same gpu type “Standard_NC6s_v3” vm into the gpu node pool for my another test, it was also successful!)

The az cli:

aks create --resource-group gpu-default-node-pool \
    --name gpu-default-node-pool-cluster \
    --kubernetes-version 1.24.3 \
    --auto-upgrade-channel patch \
    --vm-set-type VirtualMachineScaleSets \
    --os-sku Ubuntu \
    --node-count 1 \
    --min-count 1 \
    --max-count 2 \
    --nodepool-name gputest\
    --enable-cluster-autoscaler \
    --node-vm-size Standard_NC6s_v3 \
    --node-osdisk-type Ephemeral \
    --node-osdisk-size 336 \
    --network-plugin kubenet

My aks Terraform code:

resource "azurerm_resource_group" "rg" {
  name     = "gpu-default-node-pool"
  location = "japaneast"
}

resource "azurerm_kubernetes_cluster" "k8s" {
  name                      = "gpu-default-node-pool-cluster"
  location                  = azurerm_resource_group.rg.location
  resource_group_name       = azurerm_resource_group.rg.name
  kubernetes_version        = "1.24.3"
  automatic_channel_upgrade = "patch"
  sku_tier                  = var.sku_tier

  default_node_pool {
    type    = "VirtualMachineScaleSets"
    name            = "gputest"
    vm_size         = "Standard_NC6s_v3"
    os_sku          = "Ubuntu"
    os_disk_type    = "Ephemeral"
    os_disk_size_gb = 336

    enable_auto_scaling = true
    min_count           = 1
    max_count           = 2
  }

  identity {
    type = "SystemAssigned"
  }

  network_profile {
    network_plugin = "kubenet"
  }
}

My question:

Why “az cli” could create gpu node into the aks cluster successfully, but terraform could not as the same parameters?
I am not sure whether miss other required parameters in terraform resource azurerm_kubernetes_cluster.

Thanks for your reply.

Topic		Replies	Views
Create AKS cluster with node pool VMSS setting automatic_os_upgrade enabled Azure	2	1129	June 23, 2021
Unable to upgrade default node pool of AKS cluster using Terraform Kubernetes	0	462	September 26, 2023
Terraform AZURE Kubernetes NodePool recreates every time when apply Terraform Azure azure	0	755	December 13, 2022
How do I find the ips of aks nodes and create this as an output in terraform? Azure	0	658	February 22, 2022
Setting Autoscale Rules on Azure AKS VirtualMachineScaleSets type cluster Terraform	0	1020	May 19, 2020

Unable to create gpu vm in AKS default node pool using Terraform

Related topics