A performance tuning tale: Optimizing SMXV (sparse Matrix-Vector-Multiplication) on Windows [part 1 of 2]

Since a while I am involved in several teaching activities on parallel programming and in my humble opinion this also includes talking about parallel computer architectures. As I am usually responsible for Shared-Memory parallel programming with OpenMP and TBB and the like, examples and exercises include learning about and tuning for the recent multi-core architectures we are using, namely Opteron-based and Xeon-based multi-socket systems. Well, understanding the perils of Shared-Memory parallel programming is not easy, but my impression is that several students are challenged when they are asked to carry the usual obstacles of parallel programming (e.g. load imbalance) forward to the context of different systems (e.g. UMA versus cc-NUMA). So this blog post has two goals: Examine and tune a sparse Matrix-Vector-Multiplication (SMXV) kernel on several architectures with (1) putting my oral explanations into text as a brief reference and (2) showing that one can do all the analysis and tuning work on Windows as well.

From school you probably know how to do a Matrix-Vector-Multiplication for dense matrices. In the field of high performance technical computing, you typically have to deal with sparse linear algebra (unless you do a LINPACK 🙂 benchmark). In my example, the matrix is stored in CRS format and has the following structure:

Matrix Structure Plot: DROPS.

The CRS format stores just the nonzero elements of the matrix in three vectors: The val-vector contains the values of all nonzero elements, the col-vector has the same dimension as the val-vector and contains the column indices for each nonzero element, the row-vector is of the same length as there are rows in the matrix (+1) and points to the first nonzero element index (in val and col) for each matrix row. While there a several different format to save sparse matrices, the CRS format is well-suited for matrices without special properties and allows for an efficient implementation of the Matrix-Vector-Multiplication kernel. The intuitive approach for a parallel SMXV kernel may look as shown below. Let Aval, Acol and Arow be the vector-based implementations of val, col and row:

01  #pragma omp parallel for private(sum, rowend, rowbeg, nz)
02     for (i = 0; i < num_rows; i++){
03        sum = 0.0;
04        rowend = Arow[i+1];
05        rowbeg = Arow[i];
06        for (nz=rowbeg; nz<rowend; ++nz)
07        {
08           sum += Aval[nz] * x[Acol[nz]];
09        }
10        y[i] = sum;
11     }

How good is this parallelization for the matrix as shown above? Lets take a look at a two-socket quad-core Intel Xeon E5450-based system (3.0 GHz), Below, I am plotting the performance in MFLOP/s for one to eight threads using just the plain Debug configuration of Visual Studio 2008 in which OpenMP has been enabled:

Intuitive Parallelization.
Performance plot of a parallel SMXV: Intuitive Parallelization.

The speedup for two threads (about 1.7) is not too bad, but the best speedup of just 2.1 is achieved with eight threads. It does not pay off significantly to use more than four threads. This is because the Frontside Bus has an insuperable limit of about eight GB/s in total and using dedicated memory bandwidth benchmarks (e.g. STREAM) one can see that this limit can already be reached with four threads (sometimes even using just two threads). Since we are working with a sparse matrix, most accesses are quasi-random and neither the hardware prefetcher nor the compiler inserting prefetch instructions can help us any more.

In many cases, thread binding can be of some help to improve the performance. The result of thread binding is also shown as Debug w/ “scatter” binding – using this approach the threads are distributed over the machine as far away from each other as possible. For example with two threads, each thread is running on a separate socket. This strategy has the advantage of using the maximal possible cache size, but does not improve the performance significantly for this application (or: Windows is already doing a similarly good job with respect to thread binding). Nevertheless, I will use the scattered thread binding strategy in all following measurements. Now, what can we do? Let’s try compiler optimization:

Compiler Optimization.
Performance plot of a parallel SMXV: Compiler Optimization.

Switching to the Release configuration does not require any work from the user, but results in a pretty nice performance improvement. I usually enabled architecture-specific optimization as well (e.g. SSE-support is enabled in the ReleaseOpt configuration), but that does not result in any further performance improvement for this memory-bound application / benchmark. Anyway, as the compiler has optimized our code for example with respect to cache utilization, this also increases the performance when using more than one thread!

In sequential execution (aka with one thread only) we get about 570 MFLOP/s. This is only a small fraction of the peak performance one core could deliver theoretically (1 core * 3 GHz * 4 instructions/sec = 12 GFLOP/s), but this is what you have to live with given the gap between CPU speed and memory speed. In order to improve the sequential performance, we would have to examine the matrix access pattern and re-arrange / optimize this with respect to the given cache hierarchy. But for now, I would rather like to think about the parallelization again: When you look at the matrix structure plot above, you will find that the density of nonzero elements is decreasing with the matrix rows counting. Our parallelization did not respect this, so we should expect to have a load imbalance limiting our parallelization. I used the Intel Thread Profiler (available on Windows as well as on Linux) to visualize this:

Load Imbalance with SMXV.
Intel Thread Profiler: Load Imbalance with SMXV.

The default for-loop scheduling in OpenMP is static (well, on all implementations I know), thus the iteration space is divided into as many chunks as we have threads, all of approximately equal size. So the first thread (T1 in the image above) gets the part of the matrix containing the more dense rows, thus it has more work to do than the other threads. Note: The reason why the Thread Profiler claims the threads two to four have “Barrier”-overhead instead of “Imbalance”-overhead is caused by my benchmark kernel, which looks slightly different than the code snippet above, but let’s ignore that differentiation here.

So, what can we do about it? Right, OpenMP allows for pretty easy and efficient ways of influencing the for-loop scheduling strategy. We just have to extend the line 01 of the code snippet above to look like this:

01 #pragma omp parallel for private(sum, rowend, rowbeg, nz) schedule(guided)

With guided scheduling, the initial chunks have an implementation-specific size which is decreased exponentially down to the chunksize specified, or 1 in our case. For the matrix with a structure as shown above, this results in a good load balance. So this is the performance we get including all optimization we discussed so far:

Load Balancing.
Performance plot of a parallel SMXV: Load Balancing.

We started with an non-optimized serial code delivering about 350 MFLOP/s and finished with a parallel code delivering about 1000 MFLOP/s! This is still far away from a linear scaling, but this is what you see in reality with complex (aka memory-bound) applications. Regarding these results, please note the following:

  • We did not apply any dataset-specific optimization. That means if the matrix structure changes (which it does over the time of a program run in the application I took this benchmark from) we will still do well and not run into any new load balance. This is clearly an advantage of OpenMP over manual threading!
  • We did not apply any architecture-specific optimization. This code will deliver a reasonable performance on most machines. But we did not yet take a look at cc-NUMA machines (e.g AMD Opteron-based or Intel Nehalem-based systems), this will be done in part 2. On a cc-NUMA system, there is a lot of performance to win or to loose, depending on if you are doing everything right or making a mistake.
  • Was anything in here OS-specific? No, it wasn’t. I did the experiments on Windows, but could have done everything on Linux in exactly the same way. More on this in the next post as well…