Jump to content

Featured Replies

Posted

[HEADING=1]Pre-Job Health Checks on AKS: A Guide to Stable AI Workloads[/HEADING]

 

 

 

[HEADING=1]Introduction[/HEADING]

 

 

 

In the realm of AI workloads, ensuring the health and stability of compute nodes is critical. Training large AI models often spans months and relies on advanced AI supercomputers equipped with high-end GPUs like NVIDIA A100 or H100, interconnected via InfiniBand for efficient communication. These models' training workloads are complex and interdependent, with frequent updates and communications facilitated by NCCL collective communication. However, the inherent complexity also brings challenges, as any failure in GPUs or InfiniBand links—such as dropped GPUs or InfiniBand link flaps—can lead to job termination, necessitating restarts from the last checkpoint.

 

 

 

In traditional HPC schedulers such as SLURM, job prologs are employed to execute scripts before the main job begins. These scripts are often used by customers to perform health checks before launching their workloads. Similarly, in Kubernetes, init containers serve as an effective mechanism for conducting pre-job checks. Init containers execute before the main application container within a pod, enabling the execution of health checks.

 

Ensuring healthy nodes has been a challenge on Azure for both traditional HPC and AI workloads. Due to their necessity, we have a standard set of tests for GPU/IB VMs on Azure that is published in the azurehpc-health-checks repository on GitHub. These health checks are now included on our Azure HPC images, and they are integrated and can automatically run on node startup for CycleCloud with SLURM or as a pre-job health check on Azure Machine Learning. The health checks are also distributed as a container, aznhc-nv, available on the Microsoft Artifact Registry.

 

 

 

Despite these advancements, we do not yet have a published solution for running these health checks on Azure Kubernetes Service (AKS). This blog post remedies that gap by providing a step-by-step guide on how to run pre-job health checks on AKS, ensuring your AI/HPC workloads run smoothly and efficiently from the start.

 

 

 

[HEADING=1]Prerequisites[/HEADING]

 

 

 

  1. AKS Cluster: You should have an AKS cluster set up.
  2. kubectl: Ensure kubectl is installed and configured to interact with your AKS cluster.
  3. Docker: Have Docker installed to build the Docker image.
  4. Azure Container Registry (ACR): Set up an ACR to store the Docker image.

 

Note: this guide is specifically targeting the H100 GPU VMs on Azure (Standard_ND96isr_H100_v5). The healthcheck config file will need adaptation for other VM types.

 

 

 

 

[HEADING=1]Step 1: Build the Docker Image[/HEADING]

 

 

 

First, create the necessary files for your Docker image.

 

 

 

Dockerfile

 

FROM mcr.microsoft.com/aznhc/aznhc-nv:latest

RUN cd /usr/local/bin \
   && curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" \
   && chmod +x kubectl

COPY ndv5.conf /azure-nhc/conf/aznhc.conf

COPY run-healthcheck.sh /azure-nhc/run-healthcheck.sh
RUN chmod +x /azure-nhc/run-healthcheck.sh
ENTRYPOINT ["/azure-nhc/run-healthcheck.sh"]

 

 

 

ndv5.conf

 

#######################################################################
###
### Hardware checks
###
* || check_hw_cpuinfo 2 96 96
* || check_hw_physmem 1915071MB 1915071MB 5%
* || check_hw_swap 0kB 0kB 3%
* || check_hw_ib 400 mlx5_0:1
* || check_hw_ib 400 mlx5_1:1
* || check_hw_ib 400 mlx5_2:1
* || check_hw_ib 400 mlx5_3:1
* || check_hw_ib 400 mlx5_4:1
* || check_hw_ib 400 mlx5_5:1
* || check_hw_ib 400 mlx5_6:1
* || check_hw_ib 400 mlx5_7:1
* || check_hw_eth lo
* || check_hw_eth eth0
* || check_hw_topology

#######################################################################
####
#### GPU checks
####
* || check_gpu_count 8
* || check_nvsmi_healthmon
* || check_gpu_xid
* || check_gpu_bw 52 350
* || check_gpu_ecc 20000000 10000
* || check_gpu_clock_throttling
* || check_nccl_allreduce 460.0 1 /azure-nhc/topofiles/ndv5-topo.xml 16G
* || check_nvlink_status


#######################################################################
####
#### Additional IB checks
####
* || check_ib_bw_gdr 380
* || check_ib_link_flapping 6

 

 

 

run-healthcheck.sh

 

#!/bin/bash

CONF_FILE=/azure-nhc/conf/aznhc.conf
LOG_FILE=/azure-nhc/aznhc.log

nhc DETACHED_MODE=0 CONFFILE=$CONF_FILE LOGFILE=$LOG_FILE TIMEOUT=300

# Annotate node with test results
kubectl annotate node $NODE_NAME aznhc-results="$(<$LOG_FILE)" --overwrite

if grep -q "ERROR:  nhc:  Health check failed:" $LOG_FILE; then
   kubectl taint nodes "$NODE_NAME" aznhc=failed:NoExecute
   exit 1
fi

 

 

 

Build and push your Docker image:

 

export ACR_NAME=<your-acr-name>
docker build -t $ACR_NAME.azurecr.io/aks-healthcheck:latest .
docker push $ACR_NAME.azurecr.io/aks-healthcheck:latest

 

 

 

[HEADING=1]Step 2: Create Service Account and Role Bindings[/HEADING]

 

 

 

Create a [iCODE]serviceaccount.yaml[/iCODE] file to define the necessary Kubernetes service account and role bindings.

 

 

 

serviceaccount.yaml

 

apiVersion: v1
kind: ServiceAccount
metadata:
 name: aksnhc-sa
 namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: aksnhc-role
rules:
- apiGroups: [""]
 resources: ["nodes"]
 verbs: ["get", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
 name: aksnhc-rolebinding
subjects:
- kind: ServiceAccount
 name: aksnhc-sa
 namespace: default
roleRef:
 kind: ClusterRole
 name: aksnhc-role
 apiGroup: rbac.authorization.k8s.io

 

 

 

Apply the configuration:

 

kubectl apply -f serviceaccount.yaml

 

 

 

[HEADING=1]Step 3: Running the Job[/HEADING]

 

 

 

Create a healthcheck-job.yaml file to define a Kubernetes Job that executes health checks as an init container. This approach can be applied to both standard and Volcano-scheduled Jobs. If the init container fails its health checks, the node will be tainted with the [iCODE]aznhc=failed:NoExecute[/iCODE] taint. This prevents new workloads from being scheduled on the node and triggers the eviction of the current Job, forcing it to restart on a healthy node.

 

 

 

healthcheck-job.yaml

 

apiVersion: batch/v1
kind: Job
metadata:
 name: aks-healthcheck-job
spec:
 completions: $NUM_NODES
 parallelism: $NUM_NODES
 completionMode: Indexed
 ttlSecondsAfterFinished: 300
 template:
   spec:
     serviceAccountName: aksnhc-sa
     initContainers:
       - name: healthcheck
         image: $ACR_NAME.azurecr.io/aks-healthcheck:latest
         imagePullPolicy: Always
         securityContext:
           capabilities:
             add: ["IPC_LOCK"]
         env:
           - name: NODE_NAME
             valueFrom:
               fieldRef:
                 fieldPath: spec.nodeName
         volumeMounts:
           - mountPath: /dev/shm
             name: shmem
           - mountPath: /azure-nhc/syslog
             name: syslog-volume
             readOnly: true 
         resources:
           requests:
             nvidia.com/gpu: 8
             nvidia.com/mlnxnics: 8
           limits:
             nvidia.com/gpu: 8
             nvidia.com/mlnxnics: 8
     containers:
       - name: main
         image: busybox
         command: ['sh', '-c', 'echo "run torchrun or workload here..."']
         securityContext:
           capabilities:
             add: ["IPC_LOCK"]
         volumeMounts:
           - mountPath: /dev/shm
             name: shmem
         resources:
           requests:
             nvidia.com/gpu: 8
             nvidia.com/mlnxnics: 8
           limits:
             nvidia.com/gpu: 8
             nvidia.com/mlnxnics: 8
     restartPolicy: Never
     volumes:
       - name: shmem
         emptyDir:
           medium: Memory
           sizeLimit: 128Gi  
       - name: syslog-volume
         hostPath:
           path: /var/log/syslog
           type: File

 

 

 

Apply the job configuration:

 

export ACR_NAME=<your-acr-name>
export NUM_NODES=<number-of-nodes>
envsubst < healthcheck-job.yaml | kubectl apply -f -

 

 

 

[HEADING=1]Step 4: Cleaning Up[/HEADING]

 

 

 

To clean up the resources created for the health checks, you can delete the job and the service account resources:

 

 

 

export ACR_NAME=<your-acr-name>
export NUM_NODES=<number-of-nodes>
envsubst < healthcheck-job.yaml | kubectl delete -f -
kubectl delete -f serviceaccount.yaml

 

 

 

[HEADING=1]Conclusion[/HEADING]

 

 

 

By following these steps, you can effectively run health checks as an init container on your AKS nodes. This ensures your nodes meet the required health standards before your application pods are scheduled, improving the reliability and performance of your applications.

 

 

 

[HEADING=1]Further Reading[/HEADING]

 

 

 

 

Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...