Intel Sapphire Rapids on Google Compute Engine C3 — MPI & Storage Performance Evaluation

Federico Iezzi
Google Cloud - Community
10 min readApr 18, 2023

--

Over a month ago I wrote about Intel Sapphire Rapids leveraged by the GCE C3 instances at Google Cloud in the context of SPEC CPU® 2017.

In my previous article, I discussed Intel Sapphire Rapids and the Intel E2000 IPU architecture in depth. However, for this article, I will be focusing on the specific strengths of the IPU itself, namely its storage and network capabilities.

Initially, my plan was to run a DPDK application, such as VF.io VPP or TestPMD. However, due to time constraints, I shifted my focus to an MPI application instead. In terms of storage testing, I will be using FIO, a reliable and flexible tool that generates synthetic I/O patterns.

Table of Contents

  1. FIO, what’s that?
  2. I/O Storage System Setup
  3. Single PD IOPS and Bandwidth Numbers
  4. Multiple PD IOPS and Bandwidth Numbers
  5. MPI, what’s that?
  6. MPI Latency and Throughput Numbers
  7. Conclusions

FIO, what’s that?

Let’s start by looking at the official definition:

Fio was originally written to save the hassle of writing special test case programs when required to test a specific I/O workload, either for performance reasons or to find/reproduce a bug. The process of writing such a test app can be tiresome, especially if you have to do it often. A test workload is difficult to define, though. There can be any number of processes or threads involved, and they can each be using their way of generating I/O.

Fio spawns a number of threads or processes doing a particular type of I/O action as specified by the user. fio takes a number of global parameters, each inherited by the thread. The typical use of fio is to write a job file matching the I/O load one wants to simulate.

oldest commit available in the fio’s git history

Over the years (well, decades), fio has become the standard de facto concerning testing storage performance.

More info about the project is available at the official Git repository.

I/O Storage System Setup

Sixteen benchmark are executed to identify:

  • Read and Write performance at 4k random with a queue depth of 256 and using O_DIRECT hence bypassing the all Linux kernel’s I/O cache layers;
  • Read and Write latency numbers are still at 4k random but this time around with a queue depth of 4;
  • Sequential Read and write performance at 1MB with a queue depth of 64 and still using O_DIRECT;
  • Mixed Read/Write still 4k, 256 queue depth, and O_DIRECT where for the Read IOPs numbers running a typical 75% read and 25% write, while for the Write numbers, the way around, 25% read and 75% write.

The choice of a queue depth of 256 comes as a recommendation from Google, which is also the upper boundary of legacy SAS devices.

The test setup leverages the following setup:

  • There are two types of test cases: one that uses a single Persistent Disk SSD with a capacity of 65 TiB, which eliminates I/O caps from the hypervisor, and another that uses 10 PD SSDs with a capacity of 3 TiB each. The 10x PDs test case is to determine whether a single or multiple PDs are needed to saturate the available throughput, and to measure the impact of these configurations on latency;
  • As a side note, at the time of this preview, C3 doesn’t support either local NVMe or the Extreme PD;
  • The largest available C2, C2D, and C3 instances (respectively c2-standard-60, c2d-standard-112, and c3-highcpu-176) also ensure virtually no resource bottleneck in the guest as well as not hitting the I/O caps;
  • Rocky Linux 9.1 GCP Optimized was used across the board, with the same patch level and same kernel: 5.14.0–162.18.1.el9_1.cloud.x86_64;
  • Tuned version 2.19.0–1.el9.noarch configured with the throughput-performance profile;
  • FIO was installed from the OS package, with version fio-3.27–7.el9 available.
Screenshot while running fio and monitoring the nodes’ performance through iostat and top

Single PD IOPS and Bandwidth Numbers

Same as with all the previous publications, full access to the raw results is granted through the following Google Sheet:

Follows also the system topology:

C3 topology overview with a single PD mapped on the vNUMA0

For the storage results, here we can see all the numbers:

With the only exception of the superior C2 (which underneath leverages Intel Cascade Lake) latency, we can see a substantial tie between C2 and C2D (AMD Milan). C3 competes in a very different league in both pure throughput and latency:

  • Random read achieves 2.6 times faster than C2;
  • Random write scores an impressive 5.3 times faster than C2, this is mindblowing and triple-checked;
  • Perhaps quite impressive that C3 always has a comparable throughput between random read and write while both C2 and C2D can have up to a 50%-ish variation;
  • Mixed read/write scores from +3.3 to +4.6 times faster than C2. Besides the pure throughput, it’s quite remarkable noticing that the former C2/C2D generation cannot even handle more than 25.000 combined Read and Write IOPS, while C3 always goes beyond 80.000 combined IOPS;
  • The read latency is about 6% faster, while the write latency achieves a respectable 25% improvement, both results still against C2;
  • C3 tends to deliver consistent latency results and have a lower run-to-run variation than both C2 and C2D.

It is worth pointing out that the results have been repeatedly obtained across several runs and over multiple days. Furthermore, the default Speculative Executions mitigation behavior was enforced. This is quite important given the number of context switching involved in this type of test.

To put these numbers in perspective:

FIO IOPs and latency numbers for c2 vs. c2d vs. c3

Multiple PD IOPS and Bandwidth Numbers

Follows also the system topology:

C3 topology overview with multiple PDs mapped on the vNUMA0

For the storage results, here we can see all the numbers:

The picture doesn’t change much from the already solid improvements observed in the single PD test case:

  • Sequential performance degrades with C2 while it improves with C3;
  • Read latency is somewhat constant with C2 while going considerably down with C3, now achieving a 47% improvement;
  • Write latency generally degraded but C2 and C2D pass the half ms mark, making it ugly. C3 stays around 400μs and the lead over C2 is still at around 25%.

Quite interesting to notice that:

  • Even a single PD can consume all available I/O resources;
  • C2 and C2D share a common I/O pattern, probably underlaying a comparable underlying storage infrastructure (well, perhaps already clear from the name 😁).

While Intel Sapphire Rapids is a step ahead in regards to pure IPC performance, what Google was able to pull off with the E2000 and the IPU infrastructure is nothing less than spectacular. Any I/O bound workloads, combined with the superior C3 IPC will deliver tangible improvements. I can also see the opposite opinion that Google is only relaxing the I/O caps. Assuming that’s true, I don’t see how the latency could go down and also how such massive IOPs numbers are sustainable in the long run. I truly believe that the IPU architecture unlocked available resources suboptimally exposed up until the C2 architecture.

If you want to test the storage I/O performance on your own, you’re free to take my fio config file (also available in the Google Sheet) or look at the official GCP documentation:

MPI, what’s that?

The Message Passing Interface (MPI) is an open library standard for distributed memory parallelization. The library API specification is available for C and Fortran and the first standard document was released in 1994. MPI has become the de-facto standard to program HPC cluster systems, allowing for the development of portable parallel programs for a variety of parallel systems ranging from small shared memory nodes to petascale cluster systems. Popular implementations include OpenMPI, MPICH, and Intel MPI (now under the OneAPI umbrella).

Intel MPI is the recommended MPI implementation for GCP, hence it was the one used for this exercise. From my (admittedly limited) testing with Open MPI, AMD Milan (C2D) doesn’t see any particular performance degradation on the Intel libraries.

Intel has developed a set of MPI benchmarks that can help us figure out network latency and throughput. Specifically, PingPong sends many messages between two endpoints at a variable size and calculates the respective latency and achieved bandwidth. To stress the IPU infrastructure, this test was executed inter-node.

PingPong pattern

MPI Cluster Setup

The MPI Cluster is quite straightforward: a single controller node and two MPI worker nodes.

The controller node acts as a jump host, NFS server, and the place where all the MPI commands are executed. Since I did not want to set up a Slurm cluster for running PingPong, everything was statically deployed.

Two MPI nodes configured with all the best practices for running tightly coupled HPC applications, namely:

  • Running Rocky Linux 9.1 Optimized for GCP (latest image rocky-linux-9-optimized-gcp-v20230411);
  • All latest updates and kernel (5.14.0–162.18.1.el9_1.cloud.x86_64) applied;
  • Both nodes use the instrumental Compact Placement Policy to ensure the closest possible proximity and reduce network latency. During my testing, I made special attention to ensuring that both instances would land on the same physical node! This is the best possible condition to analyze the quality of the work done by the IPU/E2000 excluding everything else from the mix;
  • The nodes are connected to a dedicated VPC Jumbo Frames (8896) enabled;
  • SMT disabled;
  • SELinux disabled;
  • IPTables / NFTables disabled;
  • Speculative Executions mitigations disabled;
  • Running with gVNIC and using the latest gVNIC drivers (version 1.4.0rc1);
  • The tuned profile usedhpc-compute. While network-latency or network-throughput may have yielded better results. Something that I tried was running cpu-partitioning together with taskset in the mpirun CLI but the performance was worse than with the generic hpc-compute profile;
  • The testing is carried over c2-standard-8, c2d-standard-8, and c3-highcpu-8. This gives four physical cores to work with;
  • The MPI library selected was Intel MPI 2021.9.0 and also the recommended 2018.4.274. In my personal experience Open MPI performs quite poorly on GCP, hence it was excluded;
  • I found that for Intel MPI 2021.9, the most efficient fabric was ofi,while for 2018.4, the most efficient fabric was shm:tcp;
  • Perhaps a bit of a controversial point, I chose not to run the MPI tuning that generates tunables for every combination in the MPI cluster. As someone who is not an HPC expert, I find this approach overly specific and difficult to implement in each application. Therefore, I intentionally excluded it from this exercise.

Follows the high-level topology:

MPI Latency and Throughput Numbers

As mentioned in the introduction, the MPI Benchmarks allow the collection of both latency and bandwidth results. As usual, all findings and scripts are collected in a Google Sheet:

Screenshot while running IMB-MPI1 Allreduce and monitoring the nodes’ performance through top

Looking at the numbers for Intel MPI 2021.9, the following tables should be self-explanatory: the first column represents the number of bytes, the second column represents the number of repetitions, and the following three are the μs required for C2, C2D, and C3 while the last three columns the MB/s still for C2, C2D, and C3.

  • We can immediately see how at all small message sizes, C3 performs significantly better than C2, while, C2D delivers poor results;
  • At large message size, it’s the same story, the throughput of C3 is superior to both C2 and C2D;
  • At 64 bytes, C3 is 48% faster than C2 and a whopping 3.86 times faster than C2D (Open MPI numbers were worst for C2D).

Let’s see this also in two graphs:

Intel MPI 2021.9 — C2 vs. C2d vs. C3 where the left axis is the latency and the right axis the bandwidth
Intel MPI 2021.9 — C2 vs. C3 where the left axis is the latency and the right axis the bandwidth

Moving to Intel MPI 2018, the story is pretty much the same:

  • C3 has a smaller lead on C2 but it is still large enough to warrant the switch to the newer solution;
  • C2D (triple-checked) has really bad latency results — this was a first-hand witness last year on a customer project where both C2 and C2D were compared.
Intel MPI 2018.4 — C2 vs. C2d vs. C3 where the left axis is the latency and the right axis the bandwidth
Intel MPI 2018.4 — C2 vs. C3 where the left axis is the latency and the right axis the bandwidth

Conclusions

This was a fun and exciting exercise. Started with SPEC CPU 2017, and later evolved into more general benchmarks. Something (hopefully) clear from this deep dive is that the pure C2D IPC performance is outclassed by the whole package available with C3 which is a mix of newest Intel CPU and the IPU Architecture. The E2000 is truly a game-changer.

Hopefully, soon I’ll do something like SPEC CPU but using MLPerf. Meanwhile, stay safe!

--

--

Federico Iezzi
Google Cloud - Community

Cloud Engineer @ Google | previously RedHat | #opensource #FTW | 10+ years in the cloud | #crypto HODL | he/him | 🏳️‍🌈