Posted February 2, 20231 yr Profiling can be performed on Azure HPC VMs with various tools. Today we are going to look at MPI application profiling with AMD uProf. AMD uProf is a tool for performance and system analysis. AMD uProf can gather time and instruction-based profiles. As well as tracing and visualizing MPI processes/threads. Profiling is the analysis of an application’s execution via the measurement of system metrics. When profiling hardware counters can be sampled to verify the occurrence and frequency of certain hardware events. For example, these events could be L1 cache misses or CPU cycles per instruction. Profiling can also include time-based measurements that relate to specific instructions in an application call stack. Benefit? The insight from profiling an application’s runtime can be used to improve latency, throughput, and scalability on HPC systems. Tool Modes of operation AMD uProf is cross platform. It provides a CLI interface as well as a GUI. The CLI can generate a CSV report that can be later analyzed. However, for this experiment we will be using the CLI to gather metrics and then switch over to the GUI to analyses and visualize the data. Environment setup We will be profiling a WRF simulation. VM SKU: HBv3-series - Azure Virtual Machines | Microsoft Learn OS: AlmaLinux 8.6 (Ubuntu or CentOS can also work) AMD uProf installed and prerequisites met. (Please refer to the documentation section 3.1.1.1 for prereqs) MPI flavor: HPX-2.13/OpenMPI Collecting and Generating Reports with CLI Bash Collect profile data: PROF_CMD="AMDuProfCLI collect --config tbp -g --mpi --output-dir <output dir> mpirun <MPI flags> $PROF_CMD ./application.out Collect MPI trace: PROF_CMD="$AMD_PERF_DIR/AMDuProfCLI collect --trace mpi=openmpi,full -g --mpi --output-dir < output dir > mpirun <MPI flags> $PROF_CMD ./application.out Generate report: AMDuProfCLI report -g --detail --input-dir <output dir/prof dir> Note: To avoid latency due to processing to much data refer to the following: Collect profile data from only a few ranks are specify a short duration. Using an MPI config file can help designate which ranks to be profiled. Perform MPI trace separate from profile data collection. Other options: The –config flag can be used to provide a config file that designates which events should be sampled. Example config files are provided within the AMD uProf directory. The config flag also has predefined arguments which capture a preset list of metrics. Use the following to list them: ./AMDuProfCLI info --list collect-configs Alternatively, the –events flag can be used to pick out performance monitoring unit (PMU) events to collect. Multiple event flags can be used when collecting more than one PMU event. To view the list of PMU events use the following: ./AMDuProfCLI info --list pmu-events Profiling and Tracing WRF First, we will collect WRF profiling data. We want to specify what type of events to collect. We will use the predefined config options to collect time-based and CPI (cycles per instruction) profile. To keep things neat we will set an env variable with a string of the AMD uProf executable and arguments: PROF_CMD="./AMDuProfCLI collect --config tbp --config cpi -g --mpi --output-dir <o/p dir> The -g option enables call stack sampling and –mpi is used to specify profiling of MPI apps. (Refer to the documentation section 6.4 for a complete list of options and descriptions) Now that we have a nice env variable we can place it in our MPI command. For this I used an mpi config file to specify the ranks I want to profile on. mpi_config.txt: -np 1 $PROF_CMD ./wrf.exe" -np 119 ./ wrf.exe" mpirun command: mpirun --allow-run-as-root $PIN_PROCESSOR_LIST --rank-by slot -mca coll ^hcoll -x LD_LIBRARY_PATH -x PATH -x PWD --app mpi_config.txt Note: The $PIN_PROCESSOR_LIST variable is a string like this: “--bind-to cpulist:ordered --cpu-set 0,1,2,3,4,5,6,7,8” which ensures proper pinning of all the cores. When the WRF simulation is launched is will notify that profiling has begun and completed: Now we are ready to generate the report: ~/AMDuProf_Linux_x64_4.0.341/bin/AMDuProfCLI report -g --detail --input-dir …/profout/AMDuProf-wrf-Custom_MPI This will generate a report csv, cpu database, and uProf session file. Collecting trace data is similar to the above steps for the exception of CLI options used and the use all MPI ranks. We can now compress the output directory containing the report data and transfer to a commodity machine where we can use the AMD uProf GUI. Visualization and Analysis Overview Please refer to the AMD uProf documentation (section 5, 5.5) for how to import and use the AMD uProf GUI. Summary Hot Spots view: Analyze Function Hotspots Analyze Metrics Shows similar call stack but with filtering by process, thread. module ---Analyze Flame Graph Analyze Call Graph Sources -maps function calls to assembly instructions for detailed view Trace MPI flat profile Time chart: Support for hardware counters Hardware counter VM visibility is exposed through the Virtual Performance Monitoring Unit (VPMU). VPMU is enabled on the following VM SKUs: HC HBv2 HBv3 HB4/X References Download: AMD μProf - AMD Documentation: 57368_uProf_UG (amd.com) Continue reading...
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.