Monthly Archives: December 2009

1.5 hours for an Introduction to Parallel Programming

Whenever Prof. Christian Bischof, the head of our institute, is on duty to give the Introduction to Programming (de) lecture for first-year Computer Science students, he is keen on giving the students a glimpse on parallel programming. Same as in 2006, I was the guest lecturer this task has been assigned to. While coping with parallelism in various aspects consumes most time of my work day, these students just started to learn Java as their first programming language. Same as in 2006, I worried about how to motivate the students and what level of detail would be reasonable, and what tools and techniques to present within a timeframe of just 1.5 hours. In the following paragraphs I briefly explain what I did, and why. The slides used in the lecture are available online: Introduction to Parallel Programming; and my student Christoph Rackwitz made a screen cast of the lecture available here (although the slides are English, I am speaking German).

Programming Language: As stated above, the target audience are first-year Computer Science students attending the Introduction to Programming course. The programming language taught in the course is Java. In a previous lecture we once tried to present the examples and exercises in C, assuming that C is very similar to Java, but the students did not like that very much. Although they were able to follow the lecture and were mostly successful in the exercises, C just felt kind of foreign to most of them. Furthermore, C is not well-used in other courses later on in the studies, except for System Programming. The problem with Java is, however, that it is not commonly used in technical computing and the native approach to parallel programming in Java is oriented more towards building concurrent (business) applications than reasoning about parallel algorithms. Despite this issue we decided to use Java in the Introduction to Parallel Programming lecture in order to keep the students comfortable, and to not mess around with the example and exercise environment already provided for them. The overall goal of the lecture was to give the students an idea of the fundamental change towards parallelism, to explain the basic concepts, and to motivate them to develop an interest in this topic. We thought this is independent from the programming language.

Parallelization Model: We have Shared-Memory parallelization and Message-Passing for Clusters. It would be great to teach both, and of course we do that in advanced courses, but I do not think it is reasonable to cover both in an introductory session. In order to motivate the growing need for parallel programming at all, the trend towards Multi-Core and Many-Core is an obvious foundation. Given that, and the requirement to allow the students to work on examples and exercises on their systems at home, we decided to discuss multicore architectures and  present one model for Shared-Memory parallel programming in detail, and just provide an overview of what a Cluster is. Furthermore, we hoped that the ability to speed-up the example programs by experiencing parallelism on their very own desktops or laptops would add some motivation. This feels more real than logging in to a remote system in our cluster. In addition, providing instructions to set up a Shared-Memory parallelization tool on a student’s laptop was expected to be simpler than for a Message-Passing environment (this turned out to be true).

Parallelization Paradigm: Given our choice to cover Shared-Memory parallelization, and the requirement to use Java and to provide a suitable environment to work on examples and exercises, we basically had three choices: (i) Java-Threads and (ii) OpenMP for Java, (iii) Parallel Java (PJ) – maybe we could have looked at some other more obscure paradigms as well, but I do not think they would have contributed any new aspects. In essence, Java-Threads are similar to Posix-Threads and Win32-Threads and are well-suited for building server-type programs, but not good for parallelizing algorithms or to serve in introductory courses. Using this model, you first have to talk about setting up threads and implementing synchronization before you can start to think parallel ;-). I like OpenMP a lot for this purpose, but there is no official standard of OpenMP for Java. We looked at two implementations:

  1. JOMP, by the Edinburgh Parallel Computing Center (EPCC). To our knowledge, this was the first implementation of OpenMP for Java. It comes as a preprocessor and is easy to use. But the development has long stopped, and it does not work well with Java 1.4 and later.
  2. JaMP, by the University of Erlangen. This implementation is based on the Eclipse compiler and even extends Java for OpenMP to provide more constructs than the original standard, while still not providing full support for OpenMP 2.5. Anyhow, it worked fine with Java 1.6, was easy to install and distribute among the students and thus we used it in the lecture.

Parallel Java (short: PJ), by Alan Kaminsky at the Rochester Institute of Technology, also provides means for Shared-Memory parallelization, but in principle it is oriented towards Message-Passing. Since it provides a very nice and simplified MPI-style API, we would have used it if we included Cluster programming, but sticking to Shared-Memory parallelization we went for JaMP.

Content: What should be covered in just 1.5 hours? Well, of course we need a motivation in the beginning of why parallel programming will be more and more important in the future. We also explained why the industry is shifting towards multicore architectures, and what implications this will or may have. As explained above, the largest part of the lecture was spent on OpenMP for Java along with some examples. We started with a brief introduction on how to use JaMP and how OpenMP programs look like, then covered Worksharing and Data Scoping with several examples. I think experiencing a Data Race is a very important thing every parallel programmer should have made :-), as well as learning about reductions. This was about it for the OpenMP part then. The last minutes of the lecture were spent on clusters and their principle ideas, followed by a Summary.

Given the constraints and our reasoning outlined above, we ended up using Java as the programming language and JaMP as the paradigm to teach Shared-Memory parallelization; just mentioning that there are Clusters as well. Although the official course evaluation is not done yet, we got pretty positive feedback regarding the lecture itself, and the exercises were well-accepted.What unnerves me is the fact, that there is no real OpenMP for Java. The Erlangen team provided a good implementation along with a compiler to serve our example and exercises, but it does not provide full OpenMP 2.5 support, not to speak of OpenMP 3.0. Having a full OpenMP for Java implementation at hand would be a very valuable tool for teaching parallel programming to first-year students, since Java is the language of choice not only at RWTH Aachen University.

Do you have other opinions, experiences, or ideas? I am always in for a discussion.

Daily cc-NUMA Craziness

Since cc-NUMA architectures have become ubiquitous in the x86 server world, it is very important to optimize memory and thread or process placement, especially for Shared-Memory parallelization. In doing so I was pretty successful in optimizing several of our user codes for cc-NUMA architectures by introducing manual binding strategies. I like the cpuinfo tool that comes with Intel MPI 3.x a lot, it is to query how all the cores are related (i.e. which cores share a cache). Based on that output I used to figure out my strategies for every architecture that we have in our center or that I have access to elsewhere. However, during the last couple of days I observed some benchmark results that did not make much sense to me, and today I stumbled upon the cause for that – something I just did not expect. I will tell you in a second, but my statement is: Manual Binding can be a bad thing, although one can achieve a nice speedup by doing it right even experts can easily be fooled, therefore it is high time to get access to a standardized interface to communicate with the threading runtime and the OS!

We have dual-socket Intel Nehalem-EP systems from two different vendors: Sun and HP. The Sun systems are intended for HPC and are equipped with Xeon X5570 (2.93 GHz) CPUs, the HP systems are intended for infrastructure services and are equipped with Xeon E5540 (2.53 GHz) CPUs. Anyhow, I got hold of both, put some jobs on the boxes and was really disappointed by the speedup measurements on the HP system. In investigating the reason for that I found out that the numbering of the logical cores on both systems is different. Oh dear, two dual-socket systems with Intel Nehalem-EP processors, in one system the cores 0 and 1 are on the same socket, but in the other system they are on a different socket. Lets take a look at the output of cpuinfo on the Sun system:

Sun Nehalem-EP (linux)

Processor composition

Processors(CPU)   : 16
Packages(sockets) : 2
Cores per package : 4
Threads per core  : 2

Processor identification

Processor  Thread Id.  Core Id.  Package Id.
0          0           0         0
1          0           1         0
2          0           2         0
3          0           3         0
4          0           0         1
5          0           1         1
6          0           2         1
7          0           3         1
8          1           0         0
9          1           1         0
10         1           2         0
11         1           3         0
12         1           0         1
13         1           1         1
14         1           2         1
15         1           3         1

Placement on packages

Package Id.  Core Id.  Processors
0            0,1,2,3   (0,8)(1,9)(2,10)(3,11)
1            0,1,2,3   (4,12)(5,13)(6,14)(7,15)

Cache sharing

Cache  Size    Processors
L1     32 KB   (0,8)(1,9)(2,10)(3,11)
               (4,12)(5,13)(6,14)(7,15)
L2     256 KB  (0,8)(1,9)(2,10)(3,11)
               (4,12)(5,13)(6,14)(7,15)
L3     8 MB    (0,1,2,3,8,9,10,11)
               (4,5,6,7,12,13,14,15)

And this is the output on the HP system:

HP Nehalem-EP (linux)

Processor composition

Processors(CPU)   : 16
Packages(sockets) : 2
Cores per package : 4
Threads per core  : 2

Processor identification

Processor  Thread Id.  Core Id.  Package Id.
0          0           0         1
1          0           0         0
2          0           2         1
3          0           2         0
4          0           1         1
5          0           1         0
6          0           3         1
7          0           3         0
8          1           0         1
9          1           0         0
10         1           2         1
11         1           2         0
12         1           1         1
13         1           1         0
14         1           3         1
15         1           3         0

Placement on packages

Package Id.  Core Id.  Processors
1            0,2,1,3   (0,8)(2,10)(4,12)(6,14)
0            0,2,1,3   (1,9)(3,11)(5,13)(7,15)

Cache sharing

Cache  Size    Processors
L1     32 KB   (0,8)(1,9)(2,10)(3,11)
               (4,12)(5,13)(6,14)(7,15)
L2     256 KB  (0,8)(1,9)(2,10)(3,11)
               (4,12)(5,13)(6,14)(7,15)
L3     8 MB    (0,2,4,6,8,10,12,14)
               (1,3,5,7,9,11,13,15)

Lets take a closer look at this table. Wherever you find the identification ‘processor’, this refers to the logical core as visible to the operating system. A ‘package’ is a socket, and we have two ‘(hyper-)threads’ per ‘package’. On the Sun system, the logical cores 0 and 1 are located on the same socket, the cores 0 to 8 refer to eight full cores on two sockets making use of all caches. On the HP system, the logical cores 0 and 1 are located on two sockets, the cores 0 to 8 refer to four hyper-threaded cores on two sockets making use of only half the caches. I am not saying one of the two strategies is better – but if you use one machine to determine what the best is for you application, put this into a start-up script and change the machines in between your measurements, that you will be surprised (and not to the good).

How is the core numbering determined? Well, the short answer is “not by the OS, but by the BIOS”; the honest answer is “I don’t know exactly”. The BIOS has a lot of influence, for example one can take a look at the Advanced Configuration and Power Interface Specification (ACPI: http://www.acpi.info/DOWNLOADS/ACPIspec40.pdf) in section 5.2.17 that there is a System Locality Distance Information Table (SLIT) that lists the distance between hardware resources on different NUMA nodes. In theory the OS kernel can make use of that table, and it does or it fills in constant values (i.e. 10 = local, 20 = remote) in case the table is empty. But the ACPI specification does not specify how the table is generated – that is up to the BIOS implementation itself, and probably up to BIOS settings. The important take-away is that (i) BIOS settings influence the core numbering scheme, (ii) obviously BIOS settings are not the same across vendors, (iii) the numbering can change over time anyhow and other OSes (i.e. Windows)  do it differently -> (iv) do not rely on the numbering scheme being static.

What should you do instead? We do not have a standardized way to influence the thread / process binding. Using tools such as numactl (Linux) or start /affinity (Windows) accept core ids as argument, which is far from optimal. The same holds for explicit API calls to do the binding. Instead, the Intel compiler is following a good path: The environment variable KMP_AFFINITY can be used to define an explicit thread-to-core mapping, but it also accepts two strategies: scatter and compact. The idea of scatter is to bind the threads as far apart as possible (to use all the caches and to have all the memory bandwidth available); the idea of compact is to bind the threads as close together as possible (to profit from shared caches). Running a program with two threads using the scatter strategy on the Sun system results in binding thread 0 to the core set {0,8} and thread 1 to the core set {4,12} (-> two sockets). The same experiment on the HP systems results in binding thread 0 to the core set {1,9} and thread 1 to the core set {0,8} (-> two sockets, again). This abstracts from the hardware / system details and allows the user, who might not be an HPC expert, to concentrate on optimizing the application by choosing from just two strategies, still getting “portable performance” on Intel CPUs. A portable thread binding interface is under discussion for OpenMP 3.1 (see my previous blog post), and I am in a strong favor for allowing the user from choosing strategies. The one shortcoming of Intel’s current implementation occurs when you have multiple levels of Shared-Memory parallelization in one application and want to mix strategies – which might make sense. But this could easily be overcome. Let’s see what the future might bring, for now I just fixed my scripts to include a sanity check that the core numbering is indeed as expected…