D
damocelj
Introduction
In the fast-paced world of Financial Services, High-Performance Computing (HPC) systems in the cloud have become indispensable. From instrument pricing and risk evaluations to portfolio optimizations and regulatory workloads like CVA and FRTB, the flexibility and scalability of cloud deployments are transforming the industry. Unlike traditional HPC systems that require complex parallelization frameworks (e.g. depending on MPI and InfiniBand networking), many financial calculations can be efficiently executed on general-purpose SKUs in Azure.
Depending on the codes used to perform the calculations, many implementations leverage vendor-specific optimizations such as AVX-512 from Intel. With the recent announcement of the public preview of the 6th generation of Intel-based Dv6 VMs (see here), this article will explore the performance evolution across three generations of D32ds – from D32dsv4 to D32dsv6.
Depending on the codes used to perform the calculations, many implementations leverage vendor-specific optimizations such as AVX-512 from Intel. With the recent announcement of the public preview of the 6th generation of Intel-based Dv6 VMs (see here), this article will explore the performance evolution across three generations of D32ds – from D32dsv4 to D32dsv6.
We will follow the testing methodology similar to the article from January 2023 – “Benchmarking on Azure HPC SKUs for Financial Services Workloads” (link here).
Overview of D-Series VM in focus:
In the official announcement it was mentioned, that the upcoming Dv6 series (currently in preview) offers significant improvements over the previous Dv5 generation. Key highlights include:
- Up to 27% higher vCPU performance and a threefold increase in L3 cache compared to the previous generation Intel Dl/D/Ev5 VMs.
- Support for up to 192 vCPUs and more than 18 GiB of memory.
- Azure Boost, which provides:
- Up to 400,000 IOPS and 12 GB/s remote storage throughput.
- Up to 200 Gbps VM network bandwidth.
- A 46% increase in local SSD capacity and more than three times the read IOPS.
- NVMe interface for both local and remote disks.
Note: Enhanced security through Total Memory Encryption (TME) technology is not activated in the preview deployment and will be benchmarked once available.
VM Name | D32ds_v4 | D32ds_v5 | D32ds_v6 |
Number of vCPUs | 32 | 32 | 32 |
InfiniBand | N/A | N/A | N/A |
Processor | Intel® Xeon® Platinum 8370C (Ice Lake) or Intel® Xeon® Platinum 8272CL (Cascade Lake) | Intel® Xeon® Platinum 8370C (Ice Lake) | Intel® Xeon® Platinum 8573C (Emerald Rapids) processor |
Peak CPU Frequency | 3.4 GHz | 3.5 GHz | 3.0 GHz |
RAM per VM | 128 GB | 128 GB | 128 GB |
RAM per core | 4 GB | 4 GB | 4 GB |
Attached Disk | 1200 SSD | 1200 SSD | 440 SSD |
Technical Specifications for 3 generations of D32ds SKUs
Benchmarking Setup
For our benchmarking setup, we utilised the user-friendly, open-source test suite from Phoronix (link) to run 2 tests from OpenBenchmarking.org test suite, specifically targeting quantitative finance workloads.
The tests in the "finance suite" are divided into two groups, each running independent benchmarks. In addition to the finance test suite, we also ran the AI-Benchmark to evaluate the evolution of AI inferencing capabilities across three VM generations.
Finance Bench | AI Benchmark | |
Bonds OpenMP | Size XXS | Device Inference Score |
Repo OpenMP | Size X | Device AI Score |
Monte-Carlo OpenMP | Device Training Score |
[td]
QuantLib
[/td]Software dependencies
Component | Version |
OS Image | Ubuntu marketplace image: 24_04-lts |
Phoronix Test Suite | 10.8.5 |
Quantlib Benchmark | 1.35-dev |
Finance Bench Benchmark | 2016-07-25 |
AI Benchmark Alpha | 0.1.2 |
Python | 3.12.3 |
To run the benchmark on a freshly created D-Series VM, execute the following commands (after updating the installed packages to the latest version):
git clone GitHub - phoronix-test-suite/phoronix-test-suite: The Phoronix Test Suite open-source, cross-platform automated testing/benchmarking software.
sudo apt-get install php-cli php-xml cmake
sudo ./install-sh
phoronix-test-suite benchmark finance
For the AI Benchmark tests, a few additional steps are required. For example, creating a virtual environment for additional python packages and the installation of the tensorflow and ai-benchmark packages are required:
sudo apt install python3 python3-pip python3-virtualenv
mkdir ai-benchmark && cd ai-benchmark
virtualenv virtualenv
source virtualenv/bin/activate
pip install tensorflow
pip install ai-benchmark
phoronix-test-suite benchmark ai-benchmark
Benchmarking Runtimes and Results
The purpose of this article is to share the results of a set of benchmarks that closely align with the use cases mentioned in the introduction. Most of these use cases are predominantly CPU-bound, which is why we have limited the benchmark to D-Series VMs. For memory-bound codes that would benefit from a higher memory-to-core ratio, the new Ev6 SKU could be a suitable option.
In the picture below, you can see a representative benchmarking run on a Dv6 VM, where nearly 100% of the CPUs were utilised during execution. The individual runs of the Phoronix test suite, starting with Finance Bench and followed by QuantLib, are clearly visible.
In the picture below, you can see a representative benchmarking run on a Dv6 VM, where nearly 100% of the CPUs were utilised during execution. The individual runs of the Phoronix test suite, starting with Finance Bench and followed by QuantLib, are clearly visible.
Runtimes
Figure 1: CPU Utilization for a full Finance Benchmark Run
Benchmark | VM Size | Start Time | End Time | Duration | Minutes |
Finance Benchmark | Standard D32ds v4 | 12:08 | 15:29 | 03:21 | 201.00 |
Finance Benchmark | Standard D32ds v5 | 11:38 | 14:12 | 02:34 | 154.00 |
Finance Benchmark | Standard D32ds v6 | 11:39 | 13:27 | 01:48 | 108.00 |
Finance Bench Results
QuantLib Results
AI Benchmark Alpha Results
Discussion of the results
The results show significant performance improvements in QuantLib across the D32v4, D32v5, and D32v6 versions. Specifically, the tasks per second for Size S increased by 47.18% from D32v5 to D32v6, while Size XXS saw an increase of 45.55%.
Benchmark times for 'Repo OpenMP' and 'Bonds OpenMP' also decreased, indicating better performance. 'Repo OpenMP' times were reduced by 18.72% from D32v4 to D32v5 and by 20.46% from D32v5 to D32v6. Similarly, 'Bonds OpenMP' times decreased by 11.98% from D32v4 to D32v5 and by 18.61% from D32v5 to D32v6.
In terms of Monte-Carlo OpenMP performance, the D32v6 showed the best results with a time of 51,927.04 ms, followed by the D32v5 at 56,443.91 ms, and the D32v4 at 57,093.94 ms. The improvements were -1.14% from D32v4 to D32v5 and -8.00% from D32v5 to D32v6.
AI Benchmark Alpha scores for device inference and training also improved significantly. Inference scores increased by 15.22% from D32v4 to D32v5 and by 42.41% from D32v5 to D32v6. Training scores saw an increase of 21.82% from D32v4 to D32v5 and 43.49% from D32v5 to D32v6.
Finally, Device AI scores improved across the versions, with D32v4 scoring 6726, D32v5 scoring 7996, and D32v6 scoring 11436. The percentage increases were 18.88% from D32v4 to D32v5 and 43.02% from D32v5 to D32v6.
Next Steps & Final Comments
The public preview of the new Intel SKUs have already shown very promising benchmarking results, indicating a significant performance improvement compared to the previous D-series generations, which are still widely used in FSI scenarios.
It's important to note that your custom code or purchased libraries might exhibit different characteristics than the benchmarks selected. Therefore, we recommend validating the performance indicators with your own setup.
In this benchmarking setup, we have not disabled Hyper-Threading on the CPUs, so the available cores are exposed as virtual cores. If this scenario is of interest to you, please reach out to the authors for more information.
Additionally, Azure offers a wide range of VM families to suit various needs, including F, FX, Fa, D, Da, E, Ea, and specialized HPC SKUs like HC and HB VMs.
A dedicated validation, based on your individual code / workload, is recommended here as well, to ensure the best suited SKU is selected for the task at hand.
Continue reading...