GPU node health checks integrated into Azure Kubernetes service via node problem detector

  • Thread starter Thread starter CormacGarvey
  • Start date Start date
C

CormacGarvey

medium?v=v2&px=400.png





Introduction​


Large AI model training can take months to complete on very large AI supercomputers. These AI supercomputers consist of many high-end GPU’s (e.g NVIDIA A100 or H100) all connected with InfiniBand. The Azure NDv5 has 8 H100 GPU’s, each connected directly by NVlink 4 (on a node) and each GPU has a 400 Gbps IB link that enables it to communicate with all the other GPU’s on the AI Supercomputer.

AI model training workloads are tightly coupled, at regular intervals all the gradients need to be updated using NCCL collective communication. If any of the gpus or InfiniBand links fail (e.g. dropped GPU, InfiniBand link flap etc) this can cause the complete job to terminate (and require it to be restarted from the last checkpoint). It is imperative that any unhealthy nodes/IB fabric be identified to prevent them being included in any of the nodes used in the training job.

The Azurehpc node health repository provides a suite of recommended node health checks for all Azure specialized SKU’s (including GPU’s). In this blog post we will show how to integrate a few of the GPU node health checks into AKS (Azure kubernetes service) in such a way that

  • GPU node health checks are run at regular intervals.
  • Nodes which fail any of the GPU tests will be automatically cordoned off (to prevent any jobs being scheduled on them) and optionally drained (all pods removed from node)

We will be leveraging Node problem detector (NPD) to run the specific GPU node health checks and draino to cordon/drain any nodes that fail any of the GPU node health checks.



large?v=v2&px=999.png



GPU node health check integration into NPD​


NPD is commonly used in K8s environments to run various k8s cluster health checks and report any issues via k8s events/conditions to k8s api server. The k8s cluster can then take some action depending on how serious the condition is (e.g. for some permanent conditions, the node may be cordoned off and drained). We will leverage the NPD custom plugin



Note: GPU count, GPU NVlink, GPU XID and GPU ECC health checks are included (other GPU node health checks can also be easily included).



Get the NPD github repository

git clone http://github.com/kubernetes/node-proble-detector.git



Edit the NPD Makefile (get modified file here)

  • Build for linux_amd64 only (not ARM)

LINUX_PLATFORMS=linux_amd64

DOCKER_PLATFORMS=linux/amd64

  • Provide a unique tag

TAG?=$(VERSION)_<UNIQUE NUMBER>

  • Change registry to Azure ACR

REGISTRY?=<YOUR ACR>.azurecr.io/k8s-staging-npd

  • Change the BASEIMAGE

BASEIMAGE:=nvcr.io/nvidia/pytorch:23.03-py3



Edit NPD Dockerfile (get modified file here)

  • Change base container

FROM nvcr.io/nvidia/pytorch:23.03-py3 as builder-base

  • Install golang in container

COPY go1.22.4.linux-amd64.tar.gz .

RUN rm -rf /usr/local/go && tar -C /usr/local -xzf go1.22.4.linux-amd64.tar.gz

  • Remove unnecessary ARM packaged

#RUN clean-install util-linux bash libsystemd-dev

  • Edit entrypoint

ENTRYPOINT ["/node-problem-detector", "--config.custom-plugin-monitor=/config/custom-plugin-gpu-count.json"]



Note: You can get the golang tarball here, go1.22.4.linux-amd64.tar.gz



Build NPD without SystemLogMonitor and SystemStatsMonitor. AKS has its own NPD which will run complete monitoring, we only want our NPD to just run the GPU node tests.

BUILD_TAGS="disable_system_log_monitor disable_system_stats_monitor" make 2>&1 | tee make.out



Push the container image to ACR

make push 2>&1 make_push.out



You could add all the GPU node health check plugins and scripts to the NPD container, but it’s much more flexible to use a k8s configMap to inject them directly into the container at runtime.



Edit deployment/node-problem-detector-config.yaml add the GPU custom plugin (yaml file) and gpu health check scripts (bash scripts) to the k8s ConfigMap yaml file. (get modified file here)



Note: You can control the frequency in which the tests are run, there are parameters in the custom plugin yaml files.



Edit deployment/node-problem-detector.yaml. (get modified file here)



  • NPD command line

- --config.custom-plugin-monitor=/config/custom-plugin-gpu-count.json,/config/custom-plugin-gpu-nvlink.json,/config/custom-plugin-gpu-xid.json, ,/config/custom-plugin-gpu-ecc.json

  • Which image/container to use

image: <YOUR ACR>.azurecr.io/k8s-staging-npd/node-problem-detector:<YOUR TAG>

  • Container limits

cpu: 240m

memory: 2048Mi

  • Bash script permissions

defaultMode: 0777

  • Which files to inject into the container.

- key: kernel-monitor.json

path: kernel-monitor.json

- key: docker-monitor.json

path: docker-monitor.json

- key: custom-plugin-monitor.json

path: custom-plugin-monitor.json

- key: check_ntp.sh

path: plugin/check_ntp.sh

- key: custom-plugin-gpu-count.json

path: custom-plugin-gpu-count.json

- key: check_gpu_count.sh

path: plugin/check_gpu_count.sh

- key: custom-plugin-gpu-nvlink.json

path: custom-plugin-gpu-nvlink.json

- key: check_gpu_nvlink.sh

path: plugin/check_gpu_nvlink.sh

- key: custom-plugin-gpu-xid.json

path: custom-plugin-gpu-xid.json

- key: check_gpu_xid.sh

path: plugin/check_gpu_xid.sh



Note: I have shown how to integrate 4 GPU node health checks, other GPU health checks can be easily added.

Note: You will probably need to modify the container limits (cpu/memory) depending on how many and what GPU tests you are running.



Draino set-up​


The draino set-up is easy, we just need to tell draino which GPU node health check events/conditions to act on (e.g. cordon/drain).



Get the draino repository

git clone https://github.com/planetlabs/draino.git



Build and push draino image/container to your ACR

docker build -t <YOUR ACR>.azurecr.io/draino .

docker push <YOUR ACR>.azurecr.io/draino




Edit the drain manifest yaml file (get modified file here)

  • Add correct service account permission/rules so draino can access the k8s service

rules:

- apiGroups: ['']

resources: [events]

verbs: [create, patch, update]

- apiGroups: ['']

resources: [nodes]

verbs: [get, watch, list, update, patch]

- apiGroups: ['']

resources: [nodes/status]

verbs: [patch, watch, list, update, patch]

- apiGroups: ['']

resources: [endpoints]

verbs: [get,watch, list, create, patch, update]

- apiGroups: ['']

resources: [pods]

verbs: [get, watch, list]

- apiGroups: ['']

resources: [pods/eviction]

verbs: [create]

- apiGroups:

- extensions

- apps

resources: [daemonsets]

verbs: [get, watch, list]

  • Draino command line (Only cordon GPU nodes with these GPU conditions)

command: [/draino, --skip-drain, --node-label=accelerator=nvidia, GpuCount, GpuNvlink, GpuXid, GpuEcc]

  • Select the correct image/container

image: <YOUR ACR>.azurecr.io/draino:latest



Testing NPD+Draino GPU health checks​

Prerequisites​


You have a working AKS cluster. In this test we will be using a NDmv4 nodepool (See here on how to deploy an NDmv4 AKS nodepool).



Deploy NPD+GPU health checks

kubectl apply -f rbac.yaml

kubectl apply -f node-problem-detector-config.yaml

kubectl apply -f node-problem-detector.yaml

Note:
You should see the node-problem-detector daemonset running on NDmv4 nodes.



Deploy special draino deployment with support for GPU node health checks

kubectl apply -f manifest.yml



Note: You should see the draino deployment.



large?v=v2&px=999.png



Verify that the GPU node health checks are running (Check the NDmv4 node description and look at the node events/conditions.

large?v=v2&px=999.png

You can see the GpuNvlink, GpuXid and CpuCount conditions reporting normal status.



Now, to simulate a GPU node health check failure, we will drop one of the NDmv4 GPU’s.

nvidia-smi -i 00000001:00:00.0 -pm 0

nvidia-smi drain -p 0001:00:00.0 -m 1




Note: nvidia-ami will verify that there are 7 GPU’s (instead of the expected 8).



Check the NDmv4 node events/conditions (via node description). If shows that the GPU count test has failed, and the node has been automatically cordoned by draino (i.e. no pods can be scheduled to this node).



large?v=v2&px=999.png

Some additional considerations​


NPD is set to run periodically and can overlap with a customer’s job. The timing and type of GPU node health checks you run may affect how well the customer job performs. One possible strategy is to perform thorough node health checks on an empty cluster from time to time and to run some essential GPU node health checks that do not affect performance on regular intervals.



Conclusion​


Fully automated GPU specific health checks integrated into AKS, that

  • identify unhealthy GPU nodes
  • cordon nodes

helps to improve the reliability of large AI supercomputers running training jobs. In this blog post we showed how to integrate GPU specific health checks into NPD and then have draino look for specific GPU failure conditions and take some action (e.g cordon/drain node).

Continue reading...
 
Back
Top