Tag Archives: Load Balancing

1.5 hours for an Introduction to Parallel Programming

Whenever Prof. Christian Bischof, the head of our institute, is on duty to give the Introduction to Programming (de) lecture for first-year Computer Science students, he is keen on giving the students a glimpse on parallel programming. Same as in 2006, I was the guest lecturer this task has been assigned to. While coping with parallelism in various aspects consumes most time of my work day, these students just started to learn Java as their first programming language. Same as in 2006, I worried about how to motivate the students and what level of detail would be reasonable, and what tools and techniques to present within a timeframe of just 1.5 hours. In the following paragraphs I briefly explain what I did, and why. The slides used in the lecture are available online: Introduction to Parallel Programming; and my student Christoph Rackwitz made a screen cast of the lecture available here (although the slides are English, I am speaking German).

Programming Language: As stated above, the target audience are first-year Computer Science students attending the Introduction to Programming course. The programming language taught in the course is Java. In a previous lecture we once tried to present the examples and exercises in C, assuming that C is very similar to Java, but the students did not like that very much. Although they were able to follow the lecture and were mostly successful in the exercises, C just felt kind of foreign to most of them. Furthermore, C is not well-used in other courses later on in the studies, except for System Programming. The problem with Java is, however, that it is not commonly used in technical computing and the native approach to parallel programming in Java is oriented more towards building concurrent (business) applications than reasoning about parallel algorithms. Despite this issue we decided to use Java in the Introduction to Parallel Programming lecture in order to keep the students comfortable, and to not mess around with the example and exercise environment already provided for them. The overall goal of the lecture was to give the students an idea of the fundamental change towards parallelism, to explain the basic concepts, and to motivate them to develop an interest in this topic. We thought this is independent from the programming language.

Parallelization Model: We have Shared-Memory parallelization and Message-Passing for Clusters. It would be great to teach both, and of course we do that in advanced courses, but I do not think it is reasonable to cover both in an introductory session. In order to motivate the growing need for parallel programming at all, the trend towards Multi-Core and Many-Core is an obvious foundation. Given that, and the requirement to allow the students to work on examples and exercises on their systems at home, we decided to discuss multicore architectures and  present one model for Shared-Memory parallel programming in detail, and just provide an overview of what a Cluster is. Furthermore, we hoped that the ability to speed-up the example programs by experiencing parallelism on their very own desktops or laptops would add some motivation. This feels more real than logging in to a remote system in our cluster. In addition, providing instructions to set up a Shared-Memory parallelization tool on a student’s laptop was expected to be simpler than for a Message-Passing environment (this turned out to be true).

Parallelization Paradigm: Given our choice to cover Shared-Memory parallelization, and the requirement to use Java and to provide a suitable environment to work on examples and exercises, we basically had three choices: (i) Java-Threads and (ii) OpenMP for Java, (iii) Parallel Java (PJ) – maybe we could have looked at some other more obscure paradigms as well, but I do not think they would have contributed any new aspects. In essence, Java-Threads are similar to Posix-Threads and Win32-Threads and are well-suited for building server-type programs, but not good for parallelizing algorithms or to serve in introductory courses. Using this model, you first have to talk about setting up threads and implementing synchronization before you can start to think parallel ;-). I like OpenMP a lot for this purpose, but there is no official standard of OpenMP for Java. We looked at two implementations:

  1. JOMP, by the Edinburgh Parallel Computing Center (EPCC). To our knowledge, this was the first implementation of OpenMP for Java. It comes as a preprocessor and is easy to use. But the development has long stopped, and it does not work well with Java 1.4 and later.
  2. JaMP, by the University of Erlangen. This implementation is based on the Eclipse compiler and even extends Java for OpenMP to provide more constructs than the original standard, while still not providing full support for OpenMP 2.5. Anyhow, it worked fine with Java 1.6, was easy to install and distribute among the students and thus we used it in the lecture.

Parallel Java (short: PJ), by Alan Kaminsky at the Rochester Institute of Technology, also provides means for Shared-Memory parallelization, but in principle it is oriented towards Message-Passing. Since it provides a very nice and simplified MPI-style API, we would have used it if we included Cluster programming, but sticking to Shared-Memory parallelization we went for JaMP.

Content: What should be covered in just 1.5 hours? Well, of course we need a motivation in the beginning of why parallel programming will be more and more important in the future. We also explained why the industry is shifting towards multicore architectures, and what implications this will or may have. As explained above, the largest part of the lecture was spent on OpenMP for Java along with some examples. We started with a brief introduction on how to use JaMP and how OpenMP programs look like, then covered Worksharing and Data Scoping with several examples. I think experiencing a Data Race is a very important thing every parallel programmer should have made :-), as well as learning about reductions. This was about it for the OpenMP part then. The last minutes of the lecture were spent on clusters and their principle ideas, followed by a Summary.

Given the constraints and our reasoning outlined above, we ended up using Java as the programming language and JaMP as the paradigm to teach Shared-Memory parallelization; just mentioning that there are Clusters as well. Although the official course evaluation is not done yet, we got pretty positive feedback regarding the lecture itself, and the exercises were well-accepted.What unnerves me is the fact, that there is no real OpenMP for Java. The Erlangen team provided a good implementation along with a compiler to serve our example and exercises, but it does not provide full OpenMP 2.5 support, not to speak of OpenMP 3.0. Having a full OpenMP for Java implementation at hand would be a very valuable tool for teaching parallel programming to first-year students, since Java is the language of choice not only at RWTH Aachen University.

Do you have other opinions, experiences, or ideas? I am always in for a discussion.

A performance tuning tale: Optimizing SMXV (sparse Matrix-Vector-Multiplication) on Windows [part 1.5 of 2]

Although it is high time to deliver the second part of this blog post series, I decided to squeeze in one additional post which I named part “1.5”, as it will cover some experiments with SMXV in C#. Since I am currently preparing a lecture named Multi-Threading for Desktop Systems (it will be held in German, though) in which C# plays an important role, we took a closer look into how parallelism has made it’s way into the .NET framework version 3.5 and 4.0. The final post will then cover some more tools and performance experiments (especially regarding cc-NUMA architectures) with the focus back on native coding.

First, let us briefly recap how the SMXV was implemented and examine how this can look like in C#. As explained in my previous post, the CRS format stores just the nonzero elements of the matrix in three vectors: The val-vector contains the values of all nonzero elements, the col-vector contains the column indices for each nonzero element and the row-vector points to the first nonzero element index (in val and col) for each matrix row. Having one class to represent a CRS matrix and using an array of doubles to represent a vector, the SMXV operation encapsulated by the operator* can be implemented like this, independent of whether you use managed or unmanaged arrays:

public static double[] operator *(matrix_crs lhs, double[] rhs)
   double[] result = new double[lhs.getNumRows()];
   for (long i = 0; i < lhs.getNumRows(); ++i)
      double sum = 0;
      long rowbeg = lhs.row(i);
      long rowend = lhs.row(i + 1);
      for (long nz = rowbeg; nz < rowend; ++nz)
         sum += lhs.val(nz) * rhs[ lhs.col(nz) ];
      result[i] = sum;
   return result;

We have several options to parallelize this code, which I wil present and briefly discuss in the rest of this post.

Threading. In this approach, the programmer is responsible for managing the threads and distributing the work onto the threads. It is not too hard to implement a static work-distribution for any given number of threads, but implementing a dynamic or adaptive work-distribution is a lot of work and also error-prone. In order to implement the static approach, we need an array of threads, have to compute the iteration chunk for each thread, put the threads to work and finally wait for the threads to finish their computation.

//Compute chunks of work:Thread[] threads = new Thread[lhs.NumThreads];
long chunkSize = lhs.getNumRows() / lhs.NumThreads;
//Start threads with respective chunks:
for (int t = 0; t < threads.Length; ++t)
   threads[t] = new Thread(delegate(object o)
      int thread = (int)o;
      long firstRow = thread * chunkSize;
      long lastRow = (thread + 1) * chunkSize;
      if (thread == lhs.NumThreads - 1) lastRow = lhs.getNumRows();
      for (long i = firstRow; i < lastRow; ++i)
      { /* ... SMXV ... */ }
   //Start the thread and pass the ID:
//Wait for all threads to complete:
for(int t = 0; t < threads.Length; ++t) threads[t].Join();
return result;

Instead of managing the threads on our own, we could use the thread pool of the runtime system. From a usage point of view, this is equivalent to the version shown above, so I will not discuss this any further.

Tasks. The problem of the approach discussed above is the static work-distribution that may lead to load imbalances, and implementing a dynamic work-distribution is error-prone and depending on the code it also may be a lot of work. The goal should be to distribute the workload into smaller packages, but doing this with threads is not optimal: Threads are quite costly in the sense that creating or destroying a thread takes quite a lot of time (in computer terms) since the OS is involved, and threads also need some amount of memory. A solution for this problem are Tasks. Well, tasks are quite “in” nowadays with many people thinking on how to program multicore systems and therefore there are many definitions of what a task really is. I have given mine in previous posts on OpenMP and repeat it here briefly: A task is a small package consisting of some code to execute and some private data (access to shared data is possible, of course) which the runtime schedules for execution by a team of threads. Actually it is pretty simple to parallelize the code from above using tasks: We have to manage a list of tasks and have to decide how much work a task should do (in terms of matrix lines), and of course we have to create and start the tasks and finally wait for them to finish. See below:

//Set the size of the tasks:
List<Task> taskList = new List<Task>();
int chunkSize = 1000;
//Create the tasks that calculate the parts of the result:
for (long i = 0; i < lhs.getNumRows(); i += chunkSize)
   taskList.Add(Task.Create(delegate(object o)
      long chunkStart = (long)o;
      for(long index = (long)chunkStart;
      index < System.Math.Min(chunkStart + chunkSize, lhs.getNumRows()); index++)
      { /* ... SMXV ... */ }
   }, i));
//Wait for all tasks to finish:
return result;

Using the TPL. The downside of the approach discussed so far is that we (= the programmer) has to distribute the work manually. In OpenMP, this is done by the compiler + runtime – at least when Worksharing constructs can be employed. In the case of for-loops, one would use Worksharing in OpenMP, With the upcoming .NET Framework version 4.0 there will be something similar (but not so powerful) available for C#: The Parallel class allows for the parallelization of for-loops, when certain conditions are fulfilled (always think about possible Data Races!). Using it is pretty simple thanks to support for delegates / lambda expressions in C#, as you can see below:

Parallel.For(0, (int)lhs.getNumRows(), delegate(int i)
   /* ... SMXV ... */
return result;

Nice? I certainly like this! It is very similar to Worksharing in the sense that you instrument your code with further knowledge to (incrementally) add parallelization, while it is also nicely integrated in the core language (which OpenMP isn’t). But you have to note that this Worksharing-like functionality is different from OpenMP in certain important aspects:

  • Tasks are used implicitly. There is a significant difference between using tasks underneath to implement this parallel for-loop, and Worksharing in OpenMP: Worksharing uses explicit threads that can be bound to cores / numa nodes, while tasks are scheduled onto threads on the behalf of the runtime system. Performance will be discussed in my next blog post, but tasks can easily be moved between numa nodes and that can spoil your performance really. OpenMP has no built-in support for affinity, but the tricks how to deal with Worksharing on cc-NUMA architectures are well-known.
  • Runtime system has full control. To my current knowledge, there is no reliably way of influencing how many threads will be used to execute the implicit tasks. Even more: I think this is by design. While it is probably nice for many users and applications when the runtime figures out how many threads should be used, this is bad for the well-educated programmer as he often has better knowledge of the application than the compiler + runtime could ever figure out (about data access pattern, for instance). If you want to fine-tune this parallelization, you have hardly any option (note: this is still beta and the options may change until .NET 4.0 will be released). In OpenMP, you can influence the work-distribution in many aspects.

PLINQ. LINQ stands for language-integrated query and allows for declarative data access. When I first heard about this technology, it was demonstrated in the context of data access and I found it interesting, but not closely related to the parallelism I am interested in. Well, it turned out that PLINQ (+ parallel) can be used to parallelize a SMXV code as well (the matrix_crs class has to implement the IEnumerable / IParallelEnumerable interface):

public static double[] operator *(matrix_crs_plinq lhs, double[] rhs)
   var res = from rowIndices in lhs.AsParallel().AsOrdered()
             select RowSum(rowIndices, lhs, rhs);
   double[] result = res.ToArray();
   return result;
public static double RowSum(long[] rowIndices, matrix_crs_plinq lhs, double[] rhs)
   double rowSum = 0;
   for (long i = rowIndices[0]; i < rowIndices[1]; i++)
      rowSum += lhs.val(i) * rhs[lhs.col(i)];
   return rowSum;

Did you recognized the AsParallel() in there? That is all you have to do, once the required interfaces have been implemented. Would I recommend using PLINQ for this type of code? No, it is meant to parallelize queries on object collections and more general data sources (think of databases). But (for me at least) it is certainly interesting to see this paradigm applied to a code snippet from the scientific-technical world. As PLINQ uses TPL internally, you will probably have the same issues regarding locality, although I did not look into this too closely yet.

Let me give credit to Ashwani Mehlem, who is one of the student workers in our group. He did some of the implementation work (especially the PLINQ version) and code maintenance of the experiment framework.

A performance tuning tale: Optimizing SMXV (sparse Matrix-Vector-Multiplication) on Windows [part 1 of 2]

Since a while I am involved in several teaching activities on parallel programming and in my humble opinion this also includes talking about parallel computer architectures. As I am usually responsible for Shared-Memory parallel programming with OpenMP and TBB and the like, examples and exercises include learning about and tuning for the recent multi-core architectures we are using, namely Opteron-based and Xeon-based multi-socket systems. Well, understanding the perils of Shared-Memory parallel programming is not easy, but my impression is that several students are challenged when they are asked to carry the usual obstacles of parallel programming (e.g. load imbalance) forward to the context of different systems (e.g. UMA versus cc-NUMA). So this blog post has two goals: Examine and tune a sparse Matrix-Vector-Multiplication (SMXV) kernel on several architectures with (1) putting my oral explanations into text as a brief reference and (2) showing that one can do all the analysis and tuning work on Windows as well.

From school you probably know how to do a Matrix-Vector-Multiplication for dense matrices. In the field of high performance technical computing, you typically have to deal with sparse linear algebra (unless you do a LINPACK :-) benchmark). In my example, the matrix is stored in CRS format and has the following structure:


Matrix Structure Plot: DROPS.

The CRS format stores just the nonzero elements of the matrix in three vectors: The val-vector contains the values of all nonzero elements, the col-vector has the same dimension as the val-vector and contains the column indices for each nonzero element, the row-vector is of the same length as there are rows in the matrix (+1) and points to the first nonzero element index (in val and col) for each matrix row. While there a several different format to save sparse matrices, the CRS format is well-suited for matrices without special properties and allows for an efficient implementation of the Matrix-Vector-Multiplication kernel. The intuitive approach for a parallel SMXV kernel may look as shown below. Let Aval, Acol and Arow be the vector-based implementations of val, col and row:

01  #pragma omp parallel for private(sum, rowend, rowbeg, nz)
02     for (i = 0; i < num_rows; i++){
03        sum = 0.0;
04        rowend = Arow[i+1];
05        rowbeg = Arow[i];
06        for (nz=rowbeg; nz<rowend; ++nz)
07        {
08           sum += Aval[nz] * x[Acol[nz]];
09        }
10        y[i] = sum;
11     }

How good is this parallelization for the matrix as shown above? Lets take a look at a two-socket quad-core Intel Xeon E5450-based system (3.0 GHz), Below, I am plotting the performance in MFLOP/s for one to eight threads using just the plain Debug configuration of Visual Studio 2008 in which OpenMP has been enabled:

Intuitive Parallelization.

Performance plot of a parallel SMXV: Intuitive Parallelization.

The speedup for two threads (about 1.7) is not too bad, but the best speedup of just 2.1 is achieved with eight threads. It does not pay off significantly to use more than four threads. This is because the Frontside Bus has an insuperable limit of about eight GB/s in total and using dedicated memory bandwidth benchmarks (e.g. STREAM) one can see that this limit can already be reached with four threads (sometimes even using just two threads). Since we are working with a sparse matrix, most accesses are quasi-random and neither the hardware prefetcher nor the compiler inserting prefetch instructions can help us any more.

In many cases, thread binding can be of some help to improve the performance. The result of thread binding is also shown as Debug w/ “scatter” binding – using this approach the threads are distributed over the machine as far away from each other as possible. For example with two threads, each thread is running on a separate socket. This strategy has the advantage of using the maximal possible cache size, but does not improve the performance significantly for this application (or: Windows is already doing a similarly good job with respect to thread binding). Nevertheless, I will use the scattered thread binding strategy in all following measurements. Now, what can we do? Let’s try compiler optimization:

Compiler Optimization.

Performance plot of a parallel SMXV: Compiler Optimization.

Switching to the Release configuration does not require any work from the user, but results in a pretty nice performance improvement. I usually enabled architecture-specific optimization as well (e.g. SSE-support is enabled in the ReleaseOpt configuration), but that does not result in any further performance improvement for this memory-bound application / benchmark. Anyway, as the compiler has optimized our code for example with respect to cache utilization, this also increases the performance when using more than one thread!

In sequential execution (aka with one thread only) we get about 570 MFLOP/s. This is only a small fraction of the peak performance one core could deliver theoretically (1 core * 3 GHz * 4 instructions/sec = 12 GFLOP/s), but this is what you have to live with given the gap between CPU speed and memory speed. In order to improve the sequential performance, we would have to examine the matrix access pattern and re-arrange / optimize this with respect to the given cache hierarchy. But for now, I would rather like to think about the parallelization again: When you look at the matrix structure plot above, you will find that the density of nonzero elements is decreasing with the matrix rows counting. Our parallelization did not respect this, so we should expect to have a load imbalance limiting our parallelization. I used the Intel Thread Profiler (available on Windows as well as on Linux) to visualize this:

Load Imbalance with SMXV.

Intel Thread Profiler: Load Imbalance with SMXV.

The default for-loop scheduling in OpenMP is static (well, on all implementations I know), thus the iteration space is divided into as many chunks as we have threads, all of approximately equal size. So the first thread (T1 in the image above) gets the part of the matrix containing the more dense rows, thus it has more work to do than the other threads. Note: The reason why the Thread Profiler claims the threads two to four have “Barrier”-overhead instead of “Imbalance”-overhead is caused by my benchmark kernel, which looks slightly different than the code snippet above, but let’s ignore that differentiation here.

So, what can we do about it? Right, OpenMP allows for pretty easy and efficient ways of influencing the for-loop scheduling strategy. We just have to extend the line 01 of the code snippet above to look like this:

01 #pragma omp parallel for private(sum, rowend, rowbeg, nz) schedule(guided)

With guided scheduling, the initial chunks have an implementation-specific size which is decreased exponentially down to the chunksize specified, or 1 in our case. For the matrix with a structure as shown above, this results in a good load balance. So this is the performance we get including all optimization we discussed so far:

Load Balancing.

Performance plot of a parallel SMXV: Load Balancing.

We started with an non-optimized serial code delivering about 350 MFLOP/s and finished with a parallel code delivering about 1000 MFLOP/s! This is still far away from a linear scaling, but this is what you see in reality with complex (aka memory-bound) applications. Regarding these results, please note the following:

  • We did not apply any dataset-specific optimization. That means if the matrix structure changes (which it does over the time of a program run in the application I took this benchmark from) we will still do well and not run into any new load balance. This is clearly an advantage of OpenMP over manual threading!
  • We did not apply any architecture-specific optimization. This code will deliver a reasonable performance on most machines. But we did not yet take a look at cc-NUMA machines (e.g AMD Opteron-based or Intel Nehalem-based systems), this will be done in part 2. On a cc-NUMA system, there is a lot of performance to win or to loose, depending on if you are doing everything right or making a mistake.
  • Was anything in here OS-specific? No, it wasn’t. I did the experiments on Windows, but could have done everything on Linux in exactly the same way. More on this in the next post as well…