Posted July 19, 2024Jul 19 [HEADING=1]Pre-Job Health Checks on AKS: A Guide to Stable AI Workloads[/HEADING] [HEADING=1]Introduction[/HEADING] In the realm of AI workloads, ensuring the health and stability of compute nodes is critical. Training large AI models often spans months and relies on advanced AI supercomputers equipped with high-end GPUs like NVIDIA A100 or H100, interconnected via InfiniBand for efficient communication. These models' training workloads are complex and interdependent, with frequent updates and communications facilitated by NCCL collective communication. However, the inherent complexity also brings challenges, as any failure in GPUs or InfiniBand links—such as dropped GPUs or InfiniBand link flaps—can lead to job termination, necessitating restarts from the last checkpoint. In traditional HPC schedulers such as SLURM, job prologs are employed to execute scripts before the main job begins. These scripts are often used by customers to perform health checks before launching their workloads. Similarly, in Kubernetes, init containers serve as an effective mechanism for conducting pre-job checks. Init containers execute before the main application container within a pod, enabling the execution of health checks. Ensuring healthy nodes has been a challenge on Azure for both traditional HPC and AI workloads. Due to their necessity, we have a standard set of tests for GPU/IB VMs on Azure that is published in the azurehpc-health-checks repository on GitHub. These health checks are now included on our Azure HPC images, and they are integrated and can automatically run on node startup for CycleCloud with SLURM or as a pre-job health check on Azure Machine Learning. The health checks are also distributed as a container, aznhc-nv, available on the Microsoft Artifact Registry. Despite these advancements, we do not yet have a published solution for running these health checks on Azure Kubernetes Service (AKS). This blog post remedies that gap by providing a step-by-step guide on how to run pre-job health checks on AKS, ensuring your AI/HPC workloads run smoothly and efficiently from the start. [HEADING=1]Prerequisites[/HEADING] AKS Cluster: You should have an AKS cluster set up. kubectl: Ensure kubectl is installed and configured to interact with your AKS cluster. Docker: Have Docker installed to build the Docker image. Azure Container Registry (ACR): Set up an ACR to store the Docker image. Note: this guide is specifically targeting the H100 GPU VMs on Azure (Standard_ND96isr_H100_v5). The healthcheck config file will need adaptation for other VM types. [HEADING=1]Step 1: Build the Docker Image[/HEADING] First, create the necessary files for your Docker image. Dockerfile FROM mcr.microsoft.com/aznhc/aznhc-nv:latest RUN cd /usr/local/bin \ && curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" \ && chmod +x kubectl COPY ndv5.conf /azure-nhc/conf/aznhc.conf COPY run-healthcheck.sh /azure-nhc/run-healthcheck.sh RUN chmod +x /azure-nhc/run-healthcheck.sh ENTRYPOINT ["/azure-nhc/run-healthcheck.sh"] ndv5.conf ####################################################################### ### ### Hardware checks ### * || check_hw_cpuinfo 2 96 96 * || check_hw_physmem 1915071MB 1915071MB 5% * || check_hw_swap 0kB 0kB 3% * || check_hw_ib 400 mlx5_0:1 * || check_hw_ib 400 mlx5_1:1 * || check_hw_ib 400 mlx5_2:1 * || check_hw_ib 400 mlx5_3:1 * || check_hw_ib 400 mlx5_4:1 * || check_hw_ib 400 mlx5_5:1 * || check_hw_ib 400 mlx5_6:1 * || check_hw_ib 400 mlx5_7:1 * || check_hw_eth lo * || check_hw_eth eth0 * || check_hw_topology ####################################################################### #### #### GPU checks #### * || check_gpu_count 8 * || check_nvsmi_healthmon * || check_gpu_xid * || check_gpu_bw 52 350 * || check_gpu_ecc 20000000 10000 * || check_gpu_clock_throttling * || check_nccl_allreduce 460.0 1 /azure-nhc/topofiles/ndv5-topo.xml 16G * || check_nvlink_status ####################################################################### #### #### Additional IB checks #### * || check_ib_bw_gdr 380 * || check_ib_link_flapping 6 run-healthcheck.sh #!/bin/bash CONF_FILE=/azure-nhc/conf/aznhc.conf LOG_FILE=/azure-nhc/aznhc.log nhc DETACHED_MODE=0 CONFFILE=$CONF_FILE LOGFILE=$LOG_FILE TIMEOUT=300 # Annotate node with test results kubectl annotate node $NODE_NAME aznhc-results="$(<$LOG_FILE)" --overwrite if grep -q "ERROR: nhc: Health check failed:" $LOG_FILE; then kubectl taint nodes "$NODE_NAME" aznhc=failed:NoExecute exit 1 fi Build and push your Docker image: export ACR_NAME=<your-acr-name> docker build -t $ACR_NAME.azurecr.io/aks-healthcheck:latest . docker push $ACR_NAME.azurecr.io/aks-healthcheck:latest [HEADING=1]Step 2: Create Service Account and Role Bindings[/HEADING] Create a [iCODE]serviceaccount.yaml[/iCODE] file to define the necessary Kubernetes service account and role bindings. serviceaccount.yaml apiVersion: v1 kind: ServiceAccount metadata: name: aksnhc-sa namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: aksnhc-role rules: - apiGroups: [""] resources: ["nodes"] verbs: ["get", "patch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: aksnhc-rolebinding subjects: - kind: ServiceAccount name: aksnhc-sa namespace: default roleRef: kind: ClusterRole name: aksnhc-role apiGroup: rbac.authorization.k8s.io Apply the configuration: kubectl apply -f serviceaccount.yaml [HEADING=1]Step 3: Running the Job[/HEADING] Create a healthcheck-job.yaml file to define a Kubernetes Job that executes health checks as an init container. This approach can be applied to both standard and Volcano-scheduled Jobs. If the init container fails its health checks, the node will be tainted with the [iCODE]aznhc=failed:NoExecute[/iCODE] taint. This prevents new workloads from being scheduled on the node and triggers the eviction of the current Job, forcing it to restart on a healthy node. healthcheck-job.yaml apiVersion: batch/v1 kind: Job metadata: name: aks-healthcheck-job spec: completions: $NUM_NODES parallelism: $NUM_NODES completionMode: Indexed ttlSecondsAfterFinished: 300 template: spec: serviceAccountName: aksnhc-sa initContainers: - name: healthcheck image: $ACR_NAME.azurecr.io/aks-healthcheck:latest imagePullPolicy: Always securityContext: capabilities: add: ["IPC_LOCK"] env: - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName volumeMounts: - mountPath: /dev/shm name: shmem - mountPath: /azure-nhc/syslog name: syslog-volume readOnly: true resources: requests: nvidia.com/gpu: 8 nvidia.com/mlnxnics: 8 limits: nvidia.com/gpu: 8 nvidia.com/mlnxnics: 8 containers: - name: main image: busybox command: ['sh', '-c', 'echo "run torchrun or workload here..."'] securityContext: capabilities: add: ["IPC_LOCK"] volumeMounts: - mountPath: /dev/shm name: shmem resources: requests: nvidia.com/gpu: 8 nvidia.com/mlnxnics: 8 limits: nvidia.com/gpu: 8 nvidia.com/mlnxnics: 8 restartPolicy: Never volumes: - name: shmem emptyDir: medium: Memory sizeLimit: 128Gi - name: syslog-volume hostPath: path: /var/log/syslog type: File Apply the job configuration: export ACR_NAME=<your-acr-name> export NUM_NODES=<number-of-nodes> envsubst < healthcheck-job.yaml | kubectl apply -f - [HEADING=1]Step 4: Cleaning Up[/HEADING] To clean up the resources created for the health checks, you can delete the job and the service account resources: export ACR_NAME=<your-acr-name> export NUM_NODES=<number-of-nodes> envsubst < healthcheck-job.yaml | kubectl delete -f - kubectl delete -f serviceaccount.yaml [HEADING=1]Conclusion[/HEADING] By following these steps, you can effectively run health checks as an init container on your AKS nodes. This ensures your nodes meet the required health standards before your application pods are scheduled, improving the reliability and performance of your applications. [HEADING=1]Further Reading[/HEADING] Deployment scripts for AKS with AI examples GPU node health checks integrated into Azure Kubernetes service via node problem detector Continue reading...
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.