Jump to content

Featured Replies

Posted

Profiling can be performed on Azure HPC VMs with various tools. Today we are going to look at MPI application profiling with AMD uProf.

 

AMD uProf is a tool for performance and system analysis. AMD uProf can gather time and instruction-based profiles. As well as tracing and visualizing MPI processes/threads.

 

 

 

Profiling is the analysis of an application’s execution via the measurement of system metrics. When profiling hardware counters can be sampled to verify the occurrence and frequency of certain hardware events. For example, these events could be L1 cache misses or CPU cycles per instruction. Profiling can also include time-based measurements that relate to specific instructions in an application call stack.

 

  • Benefit?

 

The insight from profiling an application’s runtime can be used to improve latency, throughput, and scalability on HPC systems.

 

  • Tool

 

 

 

Modes of operation

 

AMD uProf is cross platform. It provides a CLI interface as well as a GUI. The CLI can generate a CSV report that can be later analyzed. However, for this experiment we will be using the CLI to gather metrics and then switch over to the GUI to analyses and visualize the data.

 

Environment setup

 

 

Collecting and Generating Reports with CLI

 

Bash

 

Collect profile data:

 

PROF_CMD="AMDuProfCLI collect --config tbp -g --mpi --output-dir <output dir>

 

mpirun <MPI flags> $PROF_CMD ./application.out

 

Collect MPI trace:

 

PROF_CMD="$AMD_PERF_DIR/AMDuProfCLI collect --trace mpi=openmpi,full -g --mpi --output-dir < output dir >

 

mpirun <MPI flags> $PROF_CMD ./application.out

 

Generate report:

 

AMDuProfCLI report -g --detail --input-dir <output dir/prof dir>

 

Note:

 

To avoid latency due to processing to much data refer to the following:

 

  • Collect profile data from only a few ranks are specify a short duration.
  • Using an MPI config file can help designate which ranks to be profiled.
  • Perform MPI trace separate from profile data collection.

 

Other options:

 

The –config flag can be used to provide a config file that designates which events should be sampled. Example config files are provided within the AMD uProf directory. The config flag also has predefined arguments which capture a preset list of metrics. Use the following to list them:

 

./AMDuProfCLI info --list collect-configs

 

Alternatively, the –events flag can be used to pick out performance monitoring unit (PMU) events to collect. Multiple event flags can be used when collecting more than one PMU event. To view the list of PMU events use the following:

 

./AMDuProfCLI info --list pmu-events

 

Profiling and Tracing WRF

 

  1. First, we will collect WRF profiling data.

    1. We want to specify what type of events to collect. We will use the predefined config options to collect time-based and CPI (cycles per instruction) profile.
    2. To keep things neat we will set an env variable with a string of the AMD uProf executable and arguments:

 

PROF_CMD="./AMDuProfCLI collect --config tbp --config cpi -g --mpi --output-dir <o/p dir>

 

  1. The -g option enables call stack sampling and –mpi is used to specify profiling of MPI apps. (Refer to the documentation section 6.4 for a complete list of options and descriptions)
  2. Now that we have a nice env variable we can place it in our MPI command. For this I used an mpi config file to specify the ranks I want to profile on.

 

 

 

mpi_config.txt:

 

-np 1 $PROF_CMD ./wrf.exe"

 

-np 119 ./ wrf.exe"

 

 

 

mpirun command:

 

mpirun --allow-run-as-root $PIN_PROCESSOR_LIST --rank-by slot -mca coll ^hcoll -x LD_LIBRARY_PATH -x PATH -x PWD --app mpi_config.txt

 

 

 

Note: The $PIN_PROCESSOR_LIST variable is a string like this:

 

“--bind-to cpulist:ordered --cpu-set 0,1,2,3,4,5,6,7,8” which ensures proper pinning of all the cores.

 

  1. When the WRF simulation is launched is will notify that profiling has begun and completed:

 

mediumvv2px400.png.208474caa25c98c377c44d2e4dab7fef.png

 

 

 

mediumvv2px400.png.c496d39d7983f40bd79063570effb5df.png

 

 

 

  1. Now we are ready to generate the report:

 

~/AMDuProf_Linux_x64_4.0.341/bin/AMDuProfCLI report -g --detail --input-dir …/profout/AMDuProf-wrf-Custom_MPI

 

  1. This will generate a report csv, cpu database, and uProf session file.
  2. Collecting trace data is similar to the above steps for the exception of CLI options used and the use all MPI ranks.
  3. We can now compress the output directory containing the report data and transfer to a commodity machine where we can use the AMD uProf GUI.

 

Visualization and Analysis Overview

 

Please refer to the AMD uProf documentation (section 5, 5.5) for how to import and use the AMD uProf GUI.

 

Summary Hot Spots view:

 

mediumvv2px400.png.fa0fd7ef65d68c0ff092de24ee0ff3a5.png

 

 

 

Analyze Function Hotspots

 

mediumvv2px400.png.47b216cb5d310b113b5f8cba2569be83.png

 

 

 

Analyze Metrics

 

Shows similar call stack but with filtering by process, thread. module

 

---Analyze Flame Graph

 

Analyze Call Graph

 

mediumvv2px400.png.fd2bf512bef00d95664520fa96b743c2.png

 

 

 

Sources

 

-maps function calls to assembly instructions for detailed view

 

Trace

 

MPI flat profile

 

Time chart:

 

mediumvv2px400.png.ebde2b21769e0dbafbdb9c383f693ce3.png

 

 

 

mediumvv2px400.png.3a68672439d752f2fae9cf9f038c08c8.png

 

 

 

mediumvv2px400.png.8b83dbb6b9de0560917bca289ae32c31.png

 

 

 

Support for hardware counters

 

Hardware counter VM visibility is exposed through the Virtual Performance Monitoring Unit (VPMU). VPMU is enabled on the following VM SKUs:

 

  • HC
  • HBv2
  • HBv3
  • HB4/X

 

References

 

 

Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...