Guest HugoAffaticati Posted October 24, 2022 Posted October 24, 2022 By Hugo Affaticati (Technical Program Manager) and Jon Shelley (Principal TPM Manager) Useful resources: NeMo Megatron from NVIDIA: NVIDIA NeMo Megatron Container from NVIDIA: NVIDIA NGC Below are the full results obtained with NVIDIA NeMo Megatron and Azure NDm A100 v4-series virtual machines (VMs) and a discussion on the parameters. NVIDIA NeMo Megatron is an end-to-end framework for training & deploying large language models (LLMs) with millions and billions of parameters. Full results: All the results were obtained with the container 22.06-hotfix and BF16 data type on GPT-3 architecture. Performance of the benchmark is based on the time taken per step to train the model after the steady state is reached. We performed training runs for LLM models with 126 million parameters to 530 billion parameters using 1 to 175 NDm A100 v4 virtual machines. Each NDm A100 v4 VM is powered by eight NVIDIA A100 80GB Tensor Core GPUs and NVIDIA Mellanox HDR Infiniband networking to scale-out to multiple nodes for distributed training of LLM models. Find the detailed results below. Model Nodes TP PP MBS Training time/step (seconds) 126M 1 1 1 4 0.9 2 1 1 4 0.5 4 1 1 4 0.2 8 1 1 4 0.1 5B 2 2 1 2 37.4 10 2 1 2 7.7 20 2 1 2 4.3 20B 36 2 4 1 9.3 40B 16 4 4 1 36.1 32 4 4 1 18.8 175B 32 8 8 2 79.9 96 8 8 2 29.1 128 8 8 2 22.8 530B 105 8 35 1 88.2 140 8 35 1 67.4 175 8 35 1 55.7 Influence of the parameters: Number of nodes Increasing the number of nodes allows to reduce the training time per global step – if it uses the full capacity of the nodes. Under that condition, Azure achieves great scaling. As it is shown on the graph below (figure 1), with the 530B model, the speedup for 140 nodes is 98.1% of 105 nodes, and for 175 nodes, it is 95.0% of 105 nodes. Figure 1 – Speedup in training time per step for NeMo Megatron GPT-3 architecture with 530B model Tensor model parallel size (TP) Since the models for NeMo Megatron are too large to fit in the memory of a single GPU, they are split across several nodes following tensor (intra-layer) and pipeline (inter-layer) model parallelism. Tensor model parallelism will enable this by portioning those individual transformer layers over several devices. TP and PP parameters are correlated and can be tuned for optimal performance. From the table below, we can conclude that the higher the TP, the slower the global step. While one could be tempted to decrease the value of TP, doing so induces an increase of the PP parameter since the model must fit across all the nodes. Model Nodes TP PP MBS Training time/step (seconds) 40B 32 8 4 1 21.3 32 4 4 1 18.8 Figure 2 – Influence of the TP parameters on the Training time Pipeline model parallel size (PP) Similarly, to tensor model parallelism, we studied the influence of the pipeline model parallelism (PP parameter). Pipeline model parallelism is used to partition the model layers into stages, and then spread those stages over multiple GPUs. From the table below with the 126M model, the training takes 1.5 times longer if we double the PP and 2.5 times longer if we quadruple the PP. Model Nodes TP PP MBS Training time/step (seconds) 126M 4 1 1 4 0.2 4 1 2 4 0.3 4 1 4 4 0.5 Figure 3 – Influence of the PP parameters on the Training time As we have shown before, decreasing both the TP and PP parameters reduces the training time. From Figure 4 below, where the value of TP*PP is constant, one can conclude that the higher the TP, the slower the global step. Thus, it is recommended to favorize a lower TP to run the models. The NDm A100 v4-series VMs on Azure allows you to reach low values of both PP and TP across all models which corresponds to faster global steps. Model Nodes TP PP MBS Training time/step (seconds) 20B 36 8 1 1 15.8 36 2 4 1 9.3 Figure 4 – Influence of the TP and PP parameters on the Training time with TP*PP=constant Mini batch size (MBS) Simply, the higher the batch size, the faster the global step. The highest value for MBS is set by the memory limit of the GPU (80GB for NDm A100 v4-series VMs on Azure). Recreate the results in Azure To learn more about how to recreate the results, please see the following link. A quick start guide to benchmarking LLM models in Azure: NVIDIA NeMo Megatron - Steps Continue reading... Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.