Since 2001 already, the IT Center (formerly: Center for Computing and Communication) of RWTH Aachen University offers a one week HPC workshop on Parallel Programming during spring time. This course is not restricted to scientists and engineers from our university, in fact we have about 30% of external attendees each time. This year we were very happy about a record attendance of up to 85 persons for the OpenMP lectures on Wednesday. As usual we publish all course materials online, but this year we also created screencasts from all presentations. That means you see the slides and the live demos and you hear the presenter talk. This blog post contains links to both the screencasts as well as the other course material, sorted by topic.
We have three talks as an introduction to OpenMP from Wednesday and two talks on selected topics from Thursday, which were vectorization and tools.
Introduction to OpenMP Programming (part 1), by Christian Terboven:
Getting OpenMP up to Speed, by Ruud van der Pas:
Introduction to OpenMP Programming (part 2), by Christian Terboven:
Vectorization with OpenMP, by Dirk Schmidl:
Tools for OpenMP Programming, by Dirk Schmidl:
We have two talks as an introduction to MPI and one on using the Vampir toolchain, all from Tuesday.
Introduction to MPI Programming (part 1), by Hristo Iliev:
Introduction to MPI Programming (part 2), by Hristo Iliev:
Introduction to VampirTrace and Vampir by Hristo Iliev:
Intel Xeon Phi
We put a special focus on presenting this architecture and we have one overview talk and one talk on using OpenMP 4.0 constructs for this architecture.
Programming the Intel Xeon Phi Coprocessor Overview, by Tim Cramer:
OpenMP 4.0 for Accelerators, by Christian Terboven:
These are just some announcements of upcoming events in which I am involved in a varying degree. The first two will be take place at RWTH Aachen University and attendance is free of charge, the second is part of the SC12 conference in Salt Lake City, UT in the US.
Tuning for bigSMP HPC Workshop – aixcelerate (October 8th – 10th, 2012). The number of cores per processor chip is increasing. Today’s “fat” compute nodes are equipped with up to 16 eight-core Intel Xeon processors, resulting in 128 phyiscal cores, with up to 2 TB of main memory. Furthermore, special solutions like a ScaleMP vSMP system may consist of 16 nodes with 4 eight-core Intel Xeon processors each and 4 TB of accumulated main memory, scaling the number of cores even further up to 1024 per machine. While message-passing with MPI is the dominating paradigm for parallel programming in the domain of high performance computing (HPC), with the growing number of cores per cluster node the combination of MPI with shared memory programming is gaining importance. The efficient use of these systems also requires NUMA-aware data management. In order to exploit different levels of parallelism, namely through shared memory programming within a node and message-passing across the nodes, obtaining good performance becomes increasingly difficult. This tuning workshop will in detail cover tools and methods to program big SMP systems. The first day will focus on OpenMP programming on big NUMA systems, the second day will focus on Intel Performance Tools as well as the ScaleMP machine, and the third day will focus on Hybrid Parallelization. Attendees are kindly requested to prepare and bring in their own code, if applicable. If you do not have an own code, but you are interested in the presented topics, you may work on prepared exercises during the lab time (hands-on). It is recommended to have good knowledge in MPI and/or OpenMP. More details and the registration link can be found at the event website.
OpenACC Tutorial Workshop (October 11th to 12th, 2012). OpenACC is a directive-based programming model for accelerators which enables delegating the responsibility for low-level (e.g. CUDA or OpenCL) programming tasks to the compiler. To this end, using the OpenACC API, the programmer can easily offload compute-intensive loops to an attached accelerator. The open industry standard OpenACC has been introduced in November 2011 and supports accelerating regions of code in standard C, C++ and Fortran. It provides portability across operating systems, host CPUs and accelerators. Up to know, OpenACC compilers exist from Cray, PGI and CAPS. During this workshop, you will work with PGI’s OpenACC implementation on Nvidia Quadro 6000 GPUs. This OpenACC workshop is divided into two parts (with separate registrations!). In the first part, we will give an introduction to the OpenACC API while focusing on GPUs. It is open for everyone who is interested in the topic. In contrast to the first part, the second part will not contain any presentations or hands-on sessions. To the second day, we invite all programmers who have their own code and want to give it a try to accelerate it on a GPU using OpenACC and with the help of our team members and Nvidia staff. More details and the registration link can be found at the event website.
Advanced OpenMP Tutorial at SC12 (November 12th, 2012). With the increasing prevalence of multicore processors, shared-memory programming models are essential. OpenMP is a popular, portable, widely supported and easy-to-use shared-memory model. Developers usually find OpenMP easy to learn. However, they are often disappointed with the performance and scalability of the resulting code. This disappointment stems not from shortcomings of OpenMP but rather with the lack of depth with which it is employed. Our “Advanced OpenMP Programming” tutorial addresses this critical need by exploring the implications of possible OpenMP parallelization strategies, both in terms of correctness and performance. While we quickly review the basics of OpenMP programming, we assume attendees understand basic parallelization concepts and will easily grasp those basics. We discuss how OpenMP features are implemented and then focus on performance aspects, such as data and thread locality on NUMA architectures, false sharing, and private versus shared data. We discuss language features in-depth, with emphasis on features recently added to OpenMP such as tasking. We close with debugging, compare various tools, and illustrate how to avoid correctness pitfalls. More details can be found on the event website.
You probably have noticed that: Visual Studio 2010 (VS2010) is about to be released. As of today, the Microsoft website states that VS2010 will be launched on April 12th. I have been playing with various builds since more than a year and I am really looking forward to taking this new version into production, since it comes loaded with plenty of new features for parallel programmers. After the launch you probably have to pay money for it, so grabbing the release candidate (RC) and taking a look at it right now may be worth it!
The feature I am talking about right now is the improved MPI Cluster Debugger that lets you execute MPI debugging jobs either locally or on a cluster, with only a minor configuration task involved. A few days ago at the German Windows-HPC User Group event Keith Yedlin from Microsoft Corp was talking about it and I demoed it live. Daniel Moth has a blog post providing an overview of that feature, MSDN has a walk-through on how to set it up, so I am not going to repeat all that content but instead explain how I am using it (and what I am still missing).
Examining variable values across MPI processes. This is a core requirement on a parallel debugger, as I stated in previous posts already. Visual Studio 2008 did allow for this already, but Visual Studio 2010 improved the way in which you inspect variables, especially if you are switching between threads and / or processes. I am not sure about the official name of the feature, but let me just call it laminate: when you put the mouse pointer over a variable the menu that will appear does not only show you the variable value, in VS2010 it also contains a sticker that you can click to keep this window persistent in front of the editor.
In my debugging workflow I got used to laminate exactly those variables that have different values on the threads and / or processes involved in my program. Whenever I switch to a different thread and / or process that in fact has a different value for that particular variable, the view will become red. This turned out to be very handy!
Debugging MPI applications on a Cluster. This became usable only with Visual Studio 2010 – before it was possible, but involved many configuration steps. In Visual Studio 2008 I complained about the task of setting up the right paths to mpiexec and mpishim – gone in Visual Studio 2010, thanks a lot. If you intend on using Microsoft MPI v2 (either on a Cluster or on a local Workstation) there is no need to configure anything at all, just switch to the MPI Cluster Debugger in the Debugging pane of your project settings. It gets even better: The field Run Environment allows you to select between the execution on your local machine, or on a Windows HPC Server 2008 Cluster:
Debugging on a cluster is particularly useful if you program is not capable of being executed with a small number of processes, or if you have a large dataset to do the debugging with and not enough memory on the local machine. The debugging session is submitted to the cluster just like a regular compute job. Setting up your debugging project for the cluster is pretty simple: just select the head node of your cluster and then the necessary resources, that is all. Hint: After selecting the head node immediately select the node group (if any) to reduce the number of compute nodes status information are being queried from, since this may take a while and let the Node Selector dialog become unresponsive for some moments.
After the configuration step you are all set up to F5 to the cluster – once your job has been started you will see no difference to a local debugging session. If you open the Debug -> Processes window from the VS2010 menu you can take a look at the Transport Qualifier column to see which node the debug processes are running on:
If you are interested, you can start the HPC Job Manager program to examine your debugging session. Unless explicitly object, Visual Studio does the whole job of deploying the runtime for your application and afterwards cleaning everything up again:
What I am still missing. The new features all are really nice, but there are still two very important things that I am missing for an MPI debugger: (i) better management of the MPI processes during debugging, and (ii) a better way to investigate variable values over multiple processes (this may include arrays). Microsoft is probably working on Dev11 already, so I hope these two points will make it into the next product version, maybe even more…
As announced in a previous post already, I was involved in two workshops attached to the HPCS 2009, hosted by the HPCVL in Kinston, ON, Canada. Being back in the office now I found some time to upload my slide sets. Obviously I can only make my own slides public.
Using OpenMP 3.0 for Parallel Programming on Multicore Systems [abstract]
Ruud van der Pas, Sun Microsystems; Dieter an Mey and Christian Terboven, RWTH Aachen University.
Just this week Allinea released it’s DDTLite plugin for Visual Studio 2008. I have been using a beta version for a couple of weeks now and in my humble opinion, DDTLite extends the MPI cluster debugger of Visual Studio 2008 with a must-have feature for any parallel debugger: Individual process control for MPI programs. With the capabilities provided by the MPI cluster debugger of Visual Studio, debugging MPI programs can be a pain as it is not possible to control MPI processes individually. That means if you select one process and execute it step-by-step, the other process will continue as well and there is no chance of stopping it from doing so (e.g. freezing as you can do with threads). This blog post is not intended to become an Allinea commercial, but I want to briefly demonstrate what DDTLite can do for you.
In order to debug MPI programs, you have to go to the project properties, choose Debugging in the left column, and select the MPI Cluster Debugger as the debugger to launch. Additionally you have to provide the following options (listed below along with my advices):
MPIRun Command: The location of mpirun. Specify the full path to the mpiexec program here, do not use “”, and do not omit the .exe extension.
MPIRun Arguments: Arguments to pass to mpirun, such as number of processes to start.
MPIShim Location: Location of mpishim.exe. As far as my experience goes, you avoid trouble if you copy mpishim.exe to a path that does not contain any white space (the original location is C:\Program Files\Microsoft Visual Studio 9.0\Common7\IDE\Remote Debugger\x86 on a 32-bit system), again do not omit the .exe extension.
That said, your configuration could look like this:
If you then start the debugger (e.g. via F5), two MPI processes will be started and you can switch between them using the Processes window (you can enable that window via the menu: Debug –> Windows –> Processes):
From the menu via Tools –> Options… –> Debugging (in the left column) you can set the option Break all processes when one process breaks to influence what happens when a breakpoint is encountered. For the case of debugging MPI programs, you probably want this option to be enabled! But – as already mentioned above – when all processes were interrupted after a breakpoint has been hit, you cannot continue with just one process step-by-step, as the other process will always do a step as well. And this is where DDTLite comes into play…
After the plugin has been enabled (via the menu: Tools –> Add-in Manager…) you are presented with several additional windows, among this is the Selected Processes and Threads window to select and switch between processes and threads, as shown above. Via the Groups – Parallel View window you can select individual processes (in the screenshot above you can see that only the MPI process with rank 0, out of two MPI processes, is selected) and then control the selection (selecting a group of processes is possible as well) using the Visual Studio debugger as you do with a serial program. All MPI processes not currently selected stand still!
There is more in DDTLite: For example you can select a variable and go to the Variable – Parallel View window to receive a list of variable values by MPI rank (the screenshot below shows the iMyRank member of a struct type named data, which denotes the MPI rank).
Of course there are even more capabilities provided by DDTLite, but you can go to the product homepage and find out for yourself by grabbing a 30-day trial version (I used that trial to create the screenshots shown in this blog post). But I would like to add one additional note on the question of how many MPI processes you should use for debugging. Most parallel debuggers (including DDTLite and DDT) are advertised that they are capable of controlling hundreds (and even thousands) of MPI processes. I think that you will hardly ever need that! Instead, I bet that in 99% of the cases in which your MPI programs works fine with one and two processes but fails when using more, you will find the issue by using three or maybe five processes with your debugger. That is all you need for finding the usual off-by-one work-distribution error and similar things.