My target:
Using Terraform to create a gpu node pool within gpu vm (Standard_NC6s_v3)
My Terraform result:
It seem worked in gitlab CICD pipeline successfully (terraform validate, plan and apply were all passed), however the node was only showed in vmss of Azure nodes group not in AKS cluster node (node count=1, Provisioning state: successful, Power state: Running 0/0 nodes ready). I used k8s API (kubectl get nodes) to check the node in my cluster, there was no gpu node in the gpu node pool.
My gpu test:
Because there is no error log message in the gitlab CICD pipeline, I used extreme example to test what kind reason cause the gpu node failed. I tried to create a gpu node in AKS default node pool using Terraform, and I got the different result of Azure portal (node count=1, Provisioning state: Failed, Power state: Running 0/0 nodes ready). However I checked the events of cluster, I got the error message this time:
Reason: InvalidDiskCapacity,
Message: invalid capacity 0 on image filesystem
I also used k8s API (kubectl get nodes) to check node info, but there was still no node in my cluster (the gpu vm was only showed in vmss of Azure nodes group).
Therefore I tried to use “az cli” with the same configuration to test this gpu creation in AKS, the node was created in default node pool successfully! (Then I used “az cli” to create a new gpu node pool and add the same gpu type “Standard_NC6s_v3” vm into the gpu node pool for my another test, it was also successful!)
The az cli:
aks create --resource-group gpu-default-node-pool \
--name gpu-default-node-pool-cluster \
--kubernetes-version 1.24.3 \
--auto-upgrade-channel patch \
--vm-set-type VirtualMachineScaleSets \
--os-sku Ubuntu \
--node-count 1 \
--min-count 1 \
--max-count 2 \
--nodepool-name gputest\
--enable-cluster-autoscaler \
--node-vm-size Standard_NC6s_v3 \
--node-osdisk-type Ephemeral \
--node-osdisk-size 336 \
--network-plugin kubenet
My aks Terraform code:
resource "azurerm_resource_group" "rg" {
name = "gpu-default-node-pool"
location = "japaneast"
}
resource "azurerm_kubernetes_cluster" "k8s" {
name = "gpu-default-node-pool-cluster"
location = azurerm_resource_group.rg.location
resource_group_name = azurerm_resource_group.rg.name
kubernetes_version = "1.24.3"
automatic_channel_upgrade = "patch"
sku_tier = var.sku_tier
default_node_pool {
type = "VirtualMachineScaleSets"
name = "gputest"
vm_size = "Standard_NC6s_v3"
os_sku = "Ubuntu"
os_disk_type = "Ephemeral"
os_disk_size_gb = 336
enable_auto_scaling = true
min_count = 1
max_count = 2
}
identity {
type = "SystemAssigned"
}
network_profile {
network_plugin = "kubenet"
}
}
My question:
- Why “az cli” could create gpu node into the aks cluster successfully, but terraform could not as the same parameters?
- I am not sure whether miss other required parameters in terraform resource azurerm_kubernetes_cluster.
Thanks for your reply.