H
HugoAffaticati
By Shantanu Deepak Patankar, Software Engineer Intern, and Hugo Affaticati, Technical Program Manager 2
Memory used (in GB) = number of parameters (in billions) × precision (in bits / 8)
Size of KV cache (in B) = batch size * sequence length * 2 * number of layers * (number of heads * dimension of head) * precision (in bits / 8)
Continue reading...
Inefficient inference optimization can lead to skyrocketing costs for customers, making it crucial to establish clear performance benchmarking numbers. This blog sets the standard for expected performance, helping customers make informed decisions that maximize efficiency and minimize expenses with the new Azure ND H200 v5-series.
We evaluated the inference performance of the new Azure ND H200 v5-series for Small Language Models (SLMs) and Large Language Models (LLMs). The ND H200 v5-series, powered by eight NVIDIA H200 Tensor Core GPUs, offers a 76% increase in memory bandwidth over the NVIDIA H100 Tensor Core GPU of the ND H100 v5-series. We compared three models: Phi 3 (128k parameters), Mistral v0.1 (7B parameters), and Llama 3.1 (8B, 70B, and 405B parameters) to set performance standards and empower Azure customers to optimize their workloads for time or resources.
Model Architecture
Achieving optimal performance requires a clear understanding of where time is spent during the inference workload, enabling effective optimization. The first critical step is to carefully examine the parameters that directly impact performance. For the models discussed, and more broadly, these key parameters include input sequence length, output sequence length, batch size, and tensor parallelism. In this article, we measured the impact of these variables using two essential metrics: throughput and first token latency.
The inference process can be categorized into three primary components: pure computation phases (e.g., local GEMMs), pure communication phases (e.g., all-reduce), and attention phases. Analyzing the Llama3 8B model on the new ND H200 v5 virtual machine revealed that computation consistently accounts for at least 50% and up to 85% of total inference time. Communication time ranges from 10% to 25%, scaling as the number of GPUs increases from 2 to 8. In contrast, attention mechanisms consistently represent less than 10% of the total time spent, as shown in Table 1. This article aims to guide customers in striking the right balance between computation and communication when selecting their AI inference architecture, based on whether time efficiency or cost-effectiveness is their primary goal.
The inference process can be categorized into three primary components: pure computation phases (e.g., local GEMMs), pure communication phases (e.g., all-reduce), and attention phases. Analyzing the Llama3 8B model on the new ND H200 v5 virtual machine revealed that computation consistently accounts for at least 50% and up to 85% of total inference time. Communication time ranges from 10% to 25%, scaling as the number of GPUs increases from 2 to 8. In contrast, attention mechanisms consistently represent less than 10% of the total time spent, as shown in Table 1. This article aims to guide customers in striking the right balance between computation and communication when selecting their AI inference architecture, based on whether time efficiency or cost-effectiveness is their primary goal.
Tensor Parallelism | Computation (% of time spent) | Communication (% of time spent) | Attention (% of time spent) |
1 GPU | 83.3 | 0 | 9.2 |
2 GPUs | 70.7 | 10.8 | 7.4 |
4 GPUs | 56.7 | 24.7 | 6.1 |
8 GPUs | 57.2 | 25.1 | 8.2 |
Table 1: Breakdown of time spent per mechanism for LLAMA 3 8B inference on the ND H200 v5 virtual machine, with an input sequence length of 1024, output sequence length of 128, and batch size of 32.
Resource optimization
Since most of the inference time is spent on computation, the GPU computational speed has a tremendous impact on the overall performance. Understanding the memory requirements ensures better GPU usage. The two main factors influencing GPU memory consumption are the model weights and the key-value cache.
Model Weights: the memory occupied by the model weights depends on the number of parameters and the quantization of the model. The memory required can be calculated using the formula:
Model Weights: the memory occupied by the model weights depends on the number of parameters and the quantization of the model. The memory required can be calculated using the formula:
Memory used (in GB) = number of parameters (in billions) × precision (in bits / 8)
For example, the model weights of a LLAMA 3 model using 8B parameters and FP8 precision, would require 8 GB of memory (8B parameters x 8 / 8 = 8 GB)
Key-Value Cache: since the attention score of each token only depends on the preceding tokens, the model stores the key and value matrices in the cache to avoid recalculating attention values for every token in the sequence, accounted for by the factor 2 in the equation below.
Key-Value Cache: since the attention score of each token only depends on the preceding tokens, the model stores the key and value matrices in the cache to avoid recalculating attention values for every token in the sequence, accounted for by the factor 2 in the equation below.
Size of KV cache (in B) = batch size * sequence length * 2 * number of layers * (number of heads * dimension of head) * precision (in bits / 8)
For example, the key-value cache of a LLAMA 3 model using 8B parameters, FP8 precision, input size 1024, and output length 128 would require 0.5 GB of memory for a batch size of 1 (1 x (1024+128) sequence length x 2 x 32 layers x 4096 x 8 / 8 = 0.5 GB)
By using these two quantities, customers can accurately estimate the maximum batch size that the virtual machines can accommodate for their model, thereby optimizing resource utilization. The available GPU memory is calculated by subtracting the weight memory from the total GPU memory when the system is idle. The maximum batch size is then determined by dividing the available memory by the size of the KV cache required for a batch size of one. Table 2 provides several examples of these theoretical batch sizes. This approach not only simplifies the process but also helps customers avoid the trial-and-error method, which can lead to higher GPU consumption and increased costs.
By using these two quantities, customers can accurately estimate the maximum batch size that the virtual machines can accommodate for their model, thereby optimizing resource utilization. The available GPU memory is calculated by subtracting the weight memory from the total GPU memory when the system is idle. The maximum batch size is then determined by dividing the available memory by the size of the KV cache required for a batch size of one. Table 2 provides several examples of these theoretical batch sizes. This approach not only simplifies the process but also helps customers avoid the trial-and-error method, which can lead to higher GPU consumption and increased costs.
Model | ND H200 v5 memory per GPU (in GB) | Number of parameters(in billions) | Weight Memory(in GB) | Available memory(in GB) | KV Cache size (in GB) | Max Batch size |
LLAMA 3 | 140 | 8 | 16 | 124 | 0.60 | 206 |
Mistral | 140 | 7 | 14 | 126 | 0.60 | 210 |
Phi-3 medium | 140 | 14 | 28 | 115.8 | 0.94 | 123 |
Table 2: Theoretical maximum batch size for inference with various languageinference models (LLAMA 3 8B, Mistral, Phi-3 medium) on the ND H200 v5 virtual machine with sequence length 1152 and FP8.
Very similar results have been obtained empirically to confirm the theoretical limits. Figure 1 below highlights the maximum batch size to maximize the usage of one NVIDIA H200 Tensor Core GPU, then combined to up to the eight other GPUs of the latest ND H200 v5 virtual machine, with the corresponding throughput. By optimizing the batch size, customers can extract extra performance from each GPU, fully utilizing available resources. This ensures that every virtual machine operates at its peak capacity, maximizing performance while minimizing cost.
Very similar results have been obtained empirically to confirm the theoretical limits. Figure 1 below highlights the maximum batch size to maximize the usage of one NVIDIA H200 Tensor Core GPU, then combined to up to the eight other GPUs of the latest ND H200 v5 virtual machine, with the corresponding throughput. By optimizing the batch size, customers can extract extra performance from each GPU, fully utilizing available resources. This ensures that every virtual machine operates at its peak capacity, maximizing performance while minimizing cost.
Figure 1: Experimental maximum batch size as a function of tensor parallelism (TP) for inference with LLAMA 3 8B on the ND H200 v5 virtual machine with sequence total length 1152.
Time optimization
For some specific workloads, time is more of the essence. While increasing the batch size can enhance throughput and maximize resource utilization, it also leads to higher latency. By measuring both latency and throughput of the inference workload, the optimal balance can be determined. For instance, when running models like Llama 3 and Mistral on a single GPU of the latest ND H200 v5 virtual machine, a batch size of 32 delivers the highest throughput-to-latency ratio, as shown in Figure 2. The optimum batch size is specific to the customer’s workload, as highlighted by the Phi-3 model, which achieves its highest ratio at a batch size of 64 with a single GPU. When scaling to two GPUs, the optimal batch size increases to 64, as illustrated in Figure 3. Although this approach may not fully utilize the available memory, it achieves the lowest possible latency for inference, making it ideal for time-sensitive applications.
Figure 2: Experimental optimal throughput to latency balance as a function of batch for inference with LLAMA 3, Phi-3 and Mistral on a single GPU of the ND H200 v5 virtual machine with sequence total length 1152, FP8, and TP 1.
Figure 3: Experimental optimal throughput to latency balance as a function of batch for inference with LLAMA 3, Phi-3 and Mistral on two GPUs of the ND H200 v5 virtual machine with sequence total length 1152, FP8, and TP 2.
Continue reading...