Comprehensive Nvidia GPU Monitoring for Azure N-Series VMs Using Telegraf with Azure Monitor

vinilv · Sep 30, 2024

In today’s AI and HPC landscapes, GPU monitoring has become essential due to the complexity and high resource demands of these workloads. Effective monitoring ensures that GPUs are utilized optimally, preventing both underutilization and overutilization, which can negatively impact performance and drive up costs. By identifying bottlenecks such as memory limitations or thermal throttling, GPU monitoring allows for performance optimization, enabling smoother workflows. In cloud environments like Azure, where GPU resources can be costly, monitoring plays a key role in managing expenses by tracking usage patterns and facilitating efficient resource allocation. Additionally, monitoring helps with capacity planning, scaling workloads, and forecasting, ensuring that resources are properly allocated for future needs.

While Azure Monitor provides robust tools for tracking CPU, memory, storage, and network usage, it does not natively support GPU monitoring for Azure N-series VMs. To track GPU performance, additional configuration through third-party tools or integration such as Telegraf is required. At the time of writing, Azure Monitor lacks built-in GPU metrics without these external solutions.

Telegraf is an open-source, lightweight agent developed by InfluxData, designed to collect, process, and send metrics and event data from various systems, applications, and services. It supports a wide range of input plugins, allowing it to gather data from sources like system stats, databases, and APIs. Telegraf can then output this data to different destinations, such as monitoring platforms like InfluxDB, Azure Monitor, or other time-series databases. Its flexibility and low resource footprint make it ideal for monitoring infrastructure and applications in real-time, especially in cloud environments.

In this blog, we will explore how to configure Telegraf to send GPU monitoring metrics to Azure Monitor. This comprehensive guide will cover all the necessary steps to enable GPU monitoring, ensuring you can track and optimize GPU performance in Azure effectively.

Step 1: Making changes in Azure for sending GPU metrics from Telegraf agents to Azure monitor from VM or VMSS.

Register the microsoft.insights resource provider in your Azure subscription. Refer: Resource providers and resource types - Azure Resource Manager | Microsoft Learn

2. Enable Managed Service Identities to authenticate an Azure VM or Azure VMSS. In the example we are using Managed Identity for authentication. You can also use User Managed Identities or Service Principle to authenticate the VM. Refer: telegraf/plugins/outputs/azure_monitor at release-1.15 · influxdata/telegraf (github.com)

Step 2: Set Up the Telegraf Agent Inside the VM or VMSS to Send Data to Azure Monitor

In this example, I will be using an Azure Standard_ND96asr_v4 VM with the Ubuntu-HPC 2204 image to configure the environment for both VM and VMSS. The Ubuntu-HPC 2204 image comes pre-installed with NVIDIA GPU drivers and CUDA. If you choose to use a different image, make sure to install the necessary GPU drivers and the CUDA toolkit.

Download and execute the `gpumon-setup.sh` script to install the Telegraf agent on Ubuntu 22.04. This script will also configure the NVIDIA SMI input plugin and set up the Telegraf configuration to send data to Azure Monitor.

Run the following commands:

Code:

wget -q https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpumon-setup.sh -O gpumon-setup.sh
chmod +x gpumon-setup.sh
./gpumon-setup.sh

Test the Telegraf configuration by executing the following command:

sudo telegraf --config /etc/telegraf/telegraf.conf --test

Step 3: Creating Dashboards in Azure Monitor to Check NVIDIA GPU Usage

Telegraf includes an output plugin specifically designed for Azure Monitor, enabling users to send custom metrics directly to the platform. Azure Monitor functions with a metric resolution of one minute; thus, the Telegraf output plugin automatically aggregates metrics into one-minute buckets, which are sent to Azure Monitor at each flush interval. Each input plugin's metrics are recorded in a separate Azure Monitor namespace, defaulting to the prefix "Telegraf/" for easy identification.

To visualize NVIDIA GPU usage, navigate to the Metrics section in Azure portal. Select the VM name as the scope, and then choose the Metric Namespace as `telegraf/nvidia-smi`. From there, you can select various metrics to view NVIDIA GPU utilization. You can also apply filters and splits for a more detailed analysis of the data.

You can create GPU monitoring dashboards for both VM and VMSS. Below are some sample charts to consider.

Bonus: Simulating GPU usage using a sample training program.

If you're testing and lack a program to simulate GPU usage, I have a solution for you! I've created a script that runs a multi-GPU distributed training model. This script will install the Anaconda software and set up the environment needed for executing the distributed training model using TensorFlow. By running this script, you can effectively simulate GPU usage, allowing you to verify the monitoring metrics you’ve set up.

To get started, run the following commands:

Code:

wget -q https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpu_test_program.sh -O gpu_test_program.sh
chmod +x gpu_test_program.sh
./gpu_test_program.sh

I hope you find this blog post helpful. With the right tools and insights, you can unlock the full potential of your GPU resources. Happy reading!

Reference:

Ubuntu-based HPC and AI Image

NDasrA100_v4 sizes series

Telegraf - Azure Monitor Output Plugin

Telegraf - Nvidia System Management Interface (SMI) Input Plugin

Continue reading...

Comprehensive Nvidia GPU Monitoring for Azure N-Series VMs Using Telegraf with Azure Monitor

vinilv