Posted July 7, 2024Jul 7 Today, researchers and developers often use a dedicated GPU for their workloads, even when only a fraction of the GPU's compute power is needed. The NVIDIA A100, A30, and H100 Tensor Core GPUs introduce a revolutionary feature called Multi-Instance GPU (MIG). MIG partitions the GPU into up to seven instances, each with its own dedicated compute, memory, and bandwidth. This enables multiple users to run their workloads on the same GPU, maximizing per-GPU utilization and boosting user productivity. In this blog, we will guide you through the process of creating a SLURM cluster and integrating NVIDIA's Multi-Instance GPU (MIG) feature to efficiently schedule GPU-accelerated jobs. We will cover the installation and configuration of SLURM, as well as the setup of MIG on NVIDIA GPUs. Overview: SLURM (Simple Linux Utility for Resource Management) is an open-source job scheduler used by many of the world’s supercomputers and HPC (High-Performance Computing) clusters. It facilitates the allocation of resources such as CPUs, memory, and GPUs to users and their jobs, ensuring efficient use of available hardware. SLURM provides robust workload management capabilities, including job queuing, prioritization, scheduling, and monitoring. MIG (Multi-Instance GPU) is a feature introduced by NVIDIA for its A100 and H100 Tensor Core GPUs, allowing a single physical GPU to be partitioned into multiple independent GPU instances. Each MIG instance operates with dedicated memory, cache, and compute cores, enabling multiple users or applications to share a single GPU securely and efficiently. This capability enhances resource utilization and provides a level of flexibility and isolation not previously possible with traditional GPUs. Advantages of Using NVIDIA MIG (Multi-Instance GPU): Improved Resource Utilization Maximizes GPU Usage: MIG allows you to run multiple smaller workloads on a single GPU, ensuring that the GPU’s resources are fully utilized. This is especially useful for applications that do not need the full capacity of a GPU. Cost Efficiency: By enabling multiple instances on a single GPU, organizations can achieve better cost-efficiency, reducing the need to purchase additional GPUs. Workload Isolation - Security and Stability: Each GPU instance is fully isolated, ensuring that workloads do not interfere with each other. This is critical for multi-tenant environments where different users or applications might run on the same physical hardware. - Predictable Performance: Isolation ensures consistent and predictable performance for each instance, avoiding resource contention issues. Scalability and Flexibility - Adaptability: MIG allows dynamic partitioning of GPU resources, making it easy to scale workloads up or down based on demand. You can allocate just the right amount of resources needed for different tasks. - Multi-Tenant Support: Ideal for cloud service providers and data centers that host services for multiple customers, each requiring different levels of GPU resources. Simplified Management - Administrative Control: Administrators can use NVIDIA tools to easily configure, manage, and monitor the GPU instances. This includes allocating specific memory and compute resources to each instance. - Automated Management: Tools and software can automate the allocation and management of GPU resources, reducing the administrative overhead. Enhanced Performance for Diverse Workloads - Support for Various Applications: MIG supports a wide range of applications, from AI inference and training to data analytics and virtual desktops. This makes it versatile for different types of computational workloads. - Optimized Performance: By running multiple instances optimized for specific tasks, you can achieve better overall performance compared to running all tasks on a single monolithic GPU. Better Utilization in Shared Environments - Educational and Research Institutions: In environments where GPUs are shared among students or researchers, MIG allows multiple users to access GPU resources simultaneously without impacting each other's work. - Development and Testing: Developers can use MIG to test and develop applications in an environment that simulates multi-GPU setups without requiring multiple physical GPUs. By leveraging the power of NVIDIA's MIG feature within a SLURM-managed cluster, you can significantly enhance the efficiency and productivity of your GPU-accelerated workloads. Join us as we delve into the steps for setting up this powerful combination and unlock the full potential of your computational resources. Prerequisites Scheduler: Size: Standard D4s v5 (4 vCPUs, 16 GiB memory) Image: Ubuntu-HPC 2204 - Gen2 (Ubuntu 22.04) Scheduling software: Slurm 23.02.7-1 Execute VM: Size: Standard NC40ads H100 v5 (40 vCPUs, 320 GiB memory) Image: Ubuntu-HPC 2204 - Gen2 (Ubuntu 22.04) - Image contains Nvidia GPU driver. It is recommended to install the latest NVIDIA GPU driver. The minimum versions are provided below: If using H100, then CUDA 12 and NVIDIA driver R525 ( >= 525.53) or later If using A100/A30, then CUDA 11 and NVIDIA driver R450 ( >= 450.80.02) or later [*]Scheduling software: Slurm 23.02.7-1 Slurm Scheduler setup: Step 1: First, create users for Munge and SLURM services to manage their operations securely. groupadd -g 11101 munge useradd -u 11101 -g 11101 -s /bin/false -M munge groupadd -g 11100 slurm useradd -u 11100 -g 11100 -s /bin/false -M slurm Step 2: Setup NFS Server on Scheduler NFS will be used to share configuration files across the cluster. apt install nfs-kernel-server -y mkdir -p /sched /shared/home echo "/sched *(rw,sync,no_root_squash)" >> /etc/exports echo "/shared *(rw,sync,no_root_squash)" >> /etc/exports systemctl restart nfs-server systemctl enable nfs-server.service showmount -e Step 3: Install and Configure Munge Munge is used for authentication across the SLURM cluster. apt install -y munge dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key cp /etc/munge/munge.key /sched/ chown munge:munge /sched/munge.key chmod 400 /sched/munge.key systemctl restart munge systemctl enable munge Step 4: Install and Configure SLURM on Scheduler Installing Slurm Scheduler daemon and setting up the directories for slurm. apt install slurm-slurmctld -y mkdir -p /etc/slurm /var/spool/slurmctld /var/log/slurmctld chown slurm:slurm /etc/slurm /var/spool/slurmctld /var/log/slurmctld Creating the `slurm.conf` file. Alternatively, you can generate the file using the Slurm configurator tool. cat <<EOF > /sched/slurm.conf MpiDefault=none ProctrackType=proctrack/cgroup ReturnToService=2 PropagateResourceLimits=ALL SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none TaskPlugin=task/affinity,task/cgroup SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core GresTypes=gpu ClusterName=mycluster JobAcctGatherType=jobacct_gather/none SlurmctldDebug=debug SlurmctldLogFile=/var/log/slurmctld/slurmctld.log SlurmctldParameters=idle_on_node_suspend SlurmdDebug=debug SlurmdLogFile=/var/log/slurmd/slurmd.log PrivateData=cloud TreeWidth=65533 ResumeTimeout=1800 SuspendTimeout=600 SuspendTime=300 SchedulerParameters=max_switch_wait=24:00:00 Include accounting.conf Include partitions.conf EOF echo "SlurmctldHost=$(hostname -s)" >> /sched/slurm.conf Creating [iCODE]cgroup.conf[/iCODE] for Slurm: This command creates a configuration file named [iCODE]cgroup.conf[/iCODE] in the [iCODE]/sched[/iCODE] directory with specific settings for Slurm's cgroup resource management. cat <<EOF > /sched/cgroup.conf CgroupAutomount=no ConstrainCores=yes ConstrainRamSpace=yes ConstrainDevices=yes EOF Configuring Accounting Storage Type for Slurm: echo "AccountingStorageType=accounting_storage/none" >> /sched/accounting.conf Changing Ownership of Configuration Files: [iCODE]chown slurm:slurm /sched/*.conf[/iCODE] Creating Symbolic Links for Configuration Files: ln -s /sched/slurm.conf /etc/slurm/slurm.conf ln -s /sched/cgroup.conf /etc/slurm/cgroup.conf ln -s /sched/accounting.conf /etc/slurm/accounting.conf Configure the Execute VM Check and Enable NVIDIA GPU Driver and MIG Mode. more details on Nvidia MIG can be found in Nvidia MIG documentation Ensure the GPU driver is installed. The Ubuntu HPC 2204 image includes the Nvidia GPU driver. If you don't have the GPU driver, make sure to install it. Here are the commands to enable Nvidia GPU MIG mode: root@h100vm:~# nvidia-smi -pm 1 Enabled persistence mode for GPU 00000001:00:00.0. All done. root@h100vm:~# nvidia-smi -mig 1 Enabled MIG Mode for GPU 00000001:00:00.0 All done. 2. Check supported profiles and create MIG partitions. The following command check the supported MIG mode in Nvidia H100 GPU. root@h100vm:~# nvidia-smi mig -lgip +-----------------------------------------------------------------------------+ | GPU instance profiles: | | GPU Name ID Instances Memory P2P SM DEC ENC | | Free/Total GiB CE JPEG OFA | |=============================================================================| | 0 MIG 1g.12gb 19 7/7 10.75 No 16 1 0 | | 1 1 0 | +-----------------------------------------------------------------------------+ | 0 MIG 1g.12gb+me 20 1/1 10.75 No 16 1 0 | | 1 1 1 | +-----------------------------------------------------------------------------+ | 0 MIG 1g.24gb 15 4/4 21.62 No 26 1 0 | | 1 1 0 | +-----------------------------------------------------------------------------+ | 0 MIG 2g.24gb 14 3/3 21.62 No 32 2 0 | | 2 2 0 | +-----------------------------------------------------------------------------+ | 0 MIG 3g.47gb 9 2/2 46.38 No 60 3 0 | | 3 3 0 | +-----------------------------------------------------------------------------+ | 0 MIG 4g.47gb 5 1/1 46.38 No 64 4 0 | | 4 4 0 | +-----------------------------------------------------------------------------+ | 0 MIG 7g.94gb 0 1/1 93.12 No 132 7 0 | | 8 7 1 | +-----------------------------------------------------------------------------+ Create the MIG partitions using the following command. In this example, we are creating 4 MIG partitions using the 1g.24gb profile. root@h100vm:~# nvidia-smi mig -cgi 15,15,15,15 -C Successfully created GPU instance ID 6 on GPU 0 using profile MIG 1g.24gb (ID 15) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 6 using profile MIG 1g.24gb (ID 7) Successfully created GPU instance ID 5 on GPU 0 using profile MIG 1g.24gb (ID 15) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 5 using profile MIG 1g.24gb (ID 7) Successfully created GPU instance ID 3 on GPU 0 using profile MIG 1g.24gb (ID 15) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 3 using profile MIG 1g.24gb (ID 7) Successfully created GPU instance ID 4 on GPU 0 using profile MIG 1g.24gb (ID 15) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 4 using profile MIG 1g.24gb (ID 7) root@h100vm:~# nvidia-smi Fri Jul 5 06:32:39 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA H100 NVL On | 00000001:00:00.0 Off | On | | N/A 38C P0 61W / 400W | 51MiB / 95830MiB | N/A Default | | | | Enabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | MIG devices: | +------------------+--------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+================================+===========+=======================| | 0 3 0 0 | 12MiB / 22144MiB | 26 0 | 1 0 1 0 1 | | | 0MiB / 32767MiB | | | +------------------+--------------------------------+-----------+-----------------------+ | 0 4 0 1 | 12MiB / 22144MiB | 26 0 | 1 0 1 0 1 | | | 0MiB / 32767MiB | | | +------------------+--------------------------------+-----------+-----------------------+ | 0 5 0 2 | 12MiB / 22144MiB | 26 0 | 1 0 1 0 1 | | | 0MiB / 32767MiB | | | +------------------+--------------------------------+-----------+-----------------------+ | 0 6 0 3 | 12MiB / 22144MiB | 26 0 | 1 0 1 0 1 | | | 0MiB / 32767MiB | | | +------------------+--------------------------------+-----------+-----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ Create Munge and SLURM users on the execute VM groupadd -g 11101 munge useradd -u 11101 -g 11101 -s /bin/false -M munge groupadd -g 11100 slurm useradd -u 11100 -g 11100 -s /bin/false -M slurm 4. Mount NFS Shares from Scheduler (Use Scheduler IP address) mkdir /shared /sched mount <scheduler ip>:/sched /sched mount <scheduler ip>:/shared /shared 5. Install and Configure Munge apt install munge -y cp /sched/munge.key /etc/munge/ chown munge:munge /etc/munge/munge.key chmod 400 /etc/munge/munge.key systemctl restart munge.service 6. Install and Configure SLURM on execute VM apt install slurm-slurmd -y mkdir -p /etc/slurm /var/spool/slurmd /var/log/slurmd chown slurm:slurm /etc/slurm /var/spool/slurmd /var/log/slurmd chown slurm:slurm /etc/slurm/ ln -s /sched/slurm.conf /etc/slurm/slurm.conf ln -s /sched/cgroup.conf /etc/slurm/cgroup.conf ln -s /sched/accounting.conf /etc/slurm/accounting.conf Create GRES Configuration for MIG. The following steps show how to use the Mig Detection program and use a single H100 system as an example. git clone https://gitlab.com/nvidia/hpc/slurm-mig-discovery.git cd slurm-mig-discovery gcc -g -o mig -I/usr/local/cuda/include -I/usr/cuda/include mig.c -lnvidia-ml ./mig 8. check the GRES config file. root@h100vm:~/slurm-mig-discovery# cat gres.conf # GPU 0 MIG 0 /proc/driver/nvidia/capabilities/gpu0/mig/gi3/access Name=gpu Type=1g.22gb File=/dev/nvidia-caps/nvidia-cap30 # GPU 0 MIG 1 /proc/driver/nvidia/capabilities/gpu0/mig/gi4/access Name=gpu Type=1g.22gb File=/dev/nvidia-caps/nvidia-cap39 # GPU 0 MIG 2 /proc/driver/nvidia/capabilities/gpu0/mig/gi5/access Name=gpu Type=1g.22gb File=/dev/nvidia-caps/nvidia-cap48 # GPU 0 MIG 3 /proc/driver/nvidia/capabilities/gpu0/mig/gi6/access Name=gpu Type=1g.22gb File=/dev/nvidia-caps/nvidia-cap57 9. copy the generated configuration file to central location. cp gres.conf cgroup_allowed_devices_file.conf /sched/ chown slurm:slurm /sched/cgroup_allowed_devices_file.conf chown slurm:slurm /sched/gres.conf 10. create symlinks to slurm configuration directory. ln -s /sched/cgroup_allowed_devices_file.conf /etc/slurm/cgroup_allowed_devices_file.conf ln -s /sched/gres.conf /etc/slurm/gres.conf 11. create slurm partitions file. This command creates a configuration file named `partitions.conf` in the `/sched` directory. It defines: - A GPU partition named `gpu` on node `h100vm` with default settings. - The node `h100vm` has 40 CPUs, 1 board, 1 socket per board, 40 cores per socket, and 1 thread per core. - It has a real memory of 322243 MB. - GPU resources are specified with 4 partitions using the `gpu:1g.22gb` profile. cat << 'EOF' > /sched/partitions.conf PartitionName=gpu Nodes=h100vm Default=YES MaxTime=INFINITE State=UP NodeName=h100vm CPUs=40 Boards=1 SocketsPerBoard=1 CoresPerSocket=40 ThreadsPerCore=1 RealMemory=322243 Gres=gpu:1g.22gb:4 EOF 12. setting the permission for partitions.conf and creating a symlink to slurm configuration directory. chown slurm:slurm /sched/partitions.conf ln -s /sched/partitions.conf /etc/slurm/partitions.conf Finalize and Start the SLURM Services On Scheduler: ln -s /sched/partitions.conf /etc/slurm/partitions.conf ln -s /sched/cgroup_allowed_devices_file.conf /etc/slurm/cgroup_allowed_devices_file.conf ln -s /sched/gres.conf /etc/slurm/gres.conf systemctl restart slurmctld systemctl enable slurmctld On Execute VM systemctl restart slurmd systemctl enable slurmd Check sinfo command on scheduler VM to verify the slurm configuration. root@scheduler:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu* up infinite 1 idle h100vm Testing the job and functionality 1. To submit the job, first create a test user. In this example, we'll create a test user named `vinil` for testing purposes. Start by creating the user on the scheduler and then on the execute VM. We have set up an NFS server to share the `/shared` directory, which will serve as the centralized home directory for the user. # On Scheduler VM useradd -m -d /shared/home/vinil -u 20001 vinil # Execute VM useradd -d /shared/home/vinil -u 20001 vinil On Scheduler VM: 2. I am using the CIFAR-10 training model to run tests on the 4 MIG instances we created. I will set up an Anaconda environment to run the CIFAR-10 job. This involves installing the TensorFlow GPU machine learning libraries and running 4 jobs simultaneously on a single node using Slurm to demonstrate the capabilities of MIG partitions and GPU workload scheduling on MIG partitions. # Download and install Anaconda software. curl -O https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh chmod +x Anaconda3-2024.06-1-Linux-x86_64.sh sh Anaconda3-2024.06-1-Linux-x86_64.sh -b 3. Create a Conda environment named `mlprog` and install the TensorFlow GPU libraries. #Setting the PATH and creating a conda environment called mlprog enviornment. export PATH=$PATH:/shared/home/vinil/anaconda3/bin /shared/home/vinil/anaconda3/bin/conda init source ~/.bashrc /shared/home/vinil/anaconda3/bin/conda create -n mlprog tensorflow-gpu -y 4. The following code will download the `cifar10.py` script, which contains the CIFAR-10 image classification machine learning code written using TensorFlow. #Download the CIFAR10 code. wget https://raw.githubusercontent.com/vinil-v/slurm-mig-setup/main/test_job_setup/cifar10.py 5. Create a job submission script named `mljob.sh` to run the job on a GPU using the Slurm scheduler. This script is designed to submit a job named `MLjob` to the GPU partition (`--partition=gpu`) of the Slurm scheduler. It allocates 10 tasks (`--ntasks=10`) and specifies GPU resources (`--gres=gpu:1g.22gb:1`). The script sets up the environment by adding Conda to the PATH and activating the `mlprog` Conda environment before executing the `cifar10.py` script to perform CIFAR-10 image classification using TensorFlow. #!/bin/sh #SBATCH --job-name=MLjob #SBATCH --partition=gpu #SBATCH --ntasks=10 #SBATCH --gres=gpu:1g.22gb:1 export PATH=$PATH:/shared/home/vinil/anaconda3/bin/conda source /shared/home/vinil/anaconda3/bin/activate mlprog python cifar10.py 6. Submit the job using the `sbatch` command and execute 4 instances of the job using the same `mljob.sh` script. This method will fully utilize all 4 MIG partitions available on the node. After submission, use the `squeue` command to check the status. You will observe all 4 jobs in the Running state. (mlprog) vinil@scheduler:~$ sbatch mljob.sh Submitted batch job 7 (mlprog) vinil@scheduler:~$ sbatch mljob.sh Submitted batch job 8 (mlprog) vinil@scheduler:~$ sbatch mljob.sh Submitted batch job 9 (mlprog) vinil@scheduler:~$ sbatch mljob.sh Submitted batch job 10 (mlprog) vinil@scheduler:~$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 gpu MLjob vinil R 0:05 1 h100vm 8 gpu MLjob vinil R 0:01 1 h100vm 9 gpu MLjob vinil R 0:01 1 h100vm 10 gpu MLjob vinil R 0:01 1 h100vm 7. Log in to the execution VM and execute the `nvidia-smi` command. You will observe that all 4 MIG GPU partitions are allocated to the jobs and are currently running.uj azureuser@h100vm:~$ nvidia-smi Fri Jul 5 07:32:50 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA H100 NVL On | 00000001:00:00.0 Off | On | | N/A 43C P0 90W / 400W | 83393MiB / 95830MiB | N/A Default | | | | Enabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | MIG devices: | +------------------+--------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+================================+===========+=======================| | 0 3 0 0 | 20846MiB / 22144MiB | 26 0 | 1 0 1 0 1 | | | 2MiB / 32767MiB | | | +------------------+--------------------------------+-----------+-----------------------+ | 0 4 0 1 | 20846MiB / 22144MiB | 26 0 | 1 0 1 0 1 | | | 2MiB / 32767MiB | | | +------------------+--------------------------------+-----------+-----------------------+ | 0 5 0 2 | 20850MiB / 22144MiB | 26 0 | 1 0 1 0 1 | | | 2MiB / 32767MiB | | | +------------------+--------------------------------+-----------+-----------------------+ | 0 6 0 3 | 20850MiB / 22144MiB | 26 0 | 1 0 1 0 1 | | | 2MiB / 32767MiB | | | +------------------+--------------------------------+-----------+-----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 3 0 11813 C python 20826MiB | | 0 4 0 11836 C python 20826MiB | | 0 5 0 11838 C python 20830MiB | | 0 6 0 11834 C python 20830MiB | +---------------------------------------------------------------------------------------+ azureuser@h100vm:~$ Conclusion: You have now successfully set up a SLURM cluster with NVIDIA MIG integration. This setup allows you to efficiently schedule and manage GPU jobs, ensuring optimal utilization of resources. With SLURM and MIG, you can achieve high performance and scalability for your computational tasks. Happy computing! Continue reading...
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.