Tag Archives: Teaching

SC14 Video: A Short Stroll Through OpenMP 4.0

During SC14, Michael Klemm from Intel and myself teamed up to give an OpenMP 4.0 overview talk at the OpenMP booth. Our goal was to touch on all important aspects, from thread binding over tasking to accelerator support, and to entertain our audience in doing so. Although not all jokes translate from German to English as we intended, I absolutely think that the resulting video is a fun-oriented 25-minutes run-down of OpenMP 4.0 and worth sharing here:

Upcoming OpenMP Tutorials

This blog post is to announce three OpenMP tutorial events that I have committed to. As usual, my OpenMP tutorials focus on Tasking early on and when it comes to performance, I will talk about dealing with NUMA architectures and thread + data affinity in detail. So if you are interested in learning more about these topics and in getting hands-on experience, the tutorials might be of interest for you.

The first one is in about two weeks from now at the Hartree Centre in the UK and part of the Hartree Summer School Series 2014. This summer school consists of three weeks in total, of which the first one is dedicated to Visualization, the second one to High Performance Computing (HPC) and the last and third week is all about Big Data. The week on HPC covers all the HPC programming foundations you might need (I would say), including my part on OpenMP.

The second tutorial event is in September as part of the IWOMP 2014 workshop in Salvador in Brazil. This year’s IWOMP will host two tutorials, the first one is a full-day Introduction to OpenMP given by my colleague Dirk Schmidl and myself. We will do an experiment this year in that we partition the tutorial into many small parts of roughly 20 minutes per topic. During these short slots we will present a specific topic, and each slot will directly be followed by practical hands-on exercises or live demos on the given topic. The second tutorial at IWOMP 2015 2014 will be a half-day tutorial on the OpenMP Accelerator Model given by Eric Stotzer. The plan is that attendees can decide for their specialization: we teach the basics in the morning and go into performance tuning for “traditional” architectures in the afternoon, while Eric will cover the target construct in detail in the afternoon.

Finally the third tutorial will be at SC14 in New Orleans in November, as our Advanced OpenMP Tutorial has been accepted again. This tutorial is really about advanced OpenMP programming for performance, as we want to enable an in-depth understanding of advanced OpenMP constructs and features to provide attendees with a set of performance and scalability recipes that can be applied to improve performance of OpenMP applications. We will also explain how to write new code for and extend existing OpenMP code to compute accelerators with the new OpenMP 4.0 capabilities and in order to do so we extended the team of previous years (consisting of Bronis R. de Supinski, Michael Klemm, Ruud van der Pas and myself) with Eric Stotzer to cover this aspect in detail.

PPCES Video Lectures on OpenMP, MPI and Xeon Phi release

Since 2001 already, the IT Center (formerly: Center for Computing and Communication) of RWTH Aachen University offers a one week HPC workshop on Parallel Programming during spring time. This course is not restricted to scientists and engineers from our university, in fact we have about 30% of external attendees each time. This year we were very happy about a record attendance of up to 85 persons for the OpenMP lectures on Wednesday. As usual we publish all course materials online, but this year we also created screencasts from all presentations. That means you see the slides and the live demos and you hear the presenter talk. This blog post contains links to both the screencasts as well as the other course material, sorted by topic.


We have three talks as an introduction to OpenMP from Wednesday and two talks on selected topics from Thursday, which were vectorization and tools.

Introduction to OpenMP Programming (part 1), by Christian Terboven:


Getting OpenMP up to Speed, by Ruud van der Pas:


Introduction to OpenMP Programming (part 2), by Christian Terboven:


Vectorization with OpenMP, by Dirk Schmidl:


Tools for OpenMP Programming, by Dirk Schmidl:



We have two talks as an introduction to MPI and one on using the Vampir toolchain, all from Tuesday.

Introduction to MPI Programming (part 1), by Hristo Iliev:


Introduction to MPI Programming (part 2), by Hristo Iliev:


Introduction to VampirTrace and Vampir by Hristo Iliev:


Intel Xeon Phi

We put a special focus on presenting this architecture and we have one overview talk and one talk on using OpenMP 4.0 constructs for this architecture.

Programming the Intel Xeon Phi Coprocessor Overview, by Tim Cramer:


OpenMP 4.0 for Accelerators, by Christian Terboven:


Other talks

Some more talks, for instance on using our cluster or basics of parallel computer architectures, can be found in the youtube channel: https://www.youtube.com/channel/UCtdrEoe46tD2IvJJRs_JH1A.

Advanced OpenMP Tutorial @ ISC in June in Leipzig

The International Supercomputing Conference (ISC) will take place in Leipzig, Germany, next week from June 16th to June 20th, 2013. This year the program contains tutorials again and the team of Bronis de Supinski (LLNL), Michael Klemm (Intel) and myself will offer the Advanced OpenMP Programming tutorial on June 16th, 9:00 AM to 1:00 PM. If you are interested in learning about performance-focused OpenMP programming and the new features in OpenMP 4.0, this might be the right one for you, although we obviously cannot cover everything in detail in just the 4 hours we got. We asked for a full day, but got only a half one.

While we quickly review the basics of OpenMP programming, we assume attendees understand basic parallelization concepts and will easily grasp those basics. We focus on performance aspects, such as data and thread locality on NUMA architectures, false sharing, and exploitation of SIMD vector units. We discuss language features in-depth, with emphasis on features recently added to OpenMP such as tasking. We close with an overview of the new OpenMP 4.0 directives for attached compute accelerators. This is our detailed agenda:

  1. OpenMP Overview (15 minutes)
    1. Core Concepts: Parallel Region, Worksharing, Nesting
    2. Synchronization: Synchronization Constructs and the Memory Model
  2. Techniques to Obtain High Performance with OpenMP: Memory Access (45 minutes)
    1. Understanding Memory Access Patterns
    2. Memory Placement and Thread Binding
    3. Performance Tips and Tricks: Avoiding False Sharing, Private versus Shared Data
  3. Techniques to Obtain High Performance with OpenMP: Vectorization (30 minutes)
    1. Understanding Vector Microarchitectures
    2. Vectorization with OpenMP 4.0
  4. Advanced Language features (60 minutes)
    1. The OpenMP Tasking Model
    2. Tasking in Detail: Final, Mergeable, and Dependencies
    3. Cancellation
    4. Misc. OpenMP 4.0 Features: Controlling the Implementation, Reduction Extensions, Improved Atomic Support
  5. OpenMP for Attached Compute Accelerators (45 minutes)
    1. The OpenMP Execution Model for Devices
    2. Target Construct
    3. OpenMP on the Intel Xeon Phi Coprocessor Examples
  6. 6. Future OpenMP Directions (15 minutes)
    1. Comprehensive OpenMP new Features Overview
    2. OpenMP 4.0 and beyond Status, Directions and Schedule
    3. Open Discussion of Possible OpenMP Extensions (until we got thrown out of the room or people have left for lunch)

OpenMP 4.0 RC1 and the Accelerator TR available

Quoting from openmp.org: OpenMP, the de-facto standard for parallel programming on shared memory systems, continues to extend its reach beyond pure HPC to include embedded systems, real time systems, and accelerators. Release Candidate 1 of the OpenMP 4.0 API specifications currently under development is now available for public discussion. This update includes thread affinity, initial support for Fortran 2003, SIMD constructs to vectorize both serial and parallelized loops, user-defined reductions, and sequentially consistent atomics. The OpenMP ARB plans to integrate the Technical Report on directives for attached accelerators, as well as more new features, in a final Release Candidate 2, to appear sometime in the first Quarter of 2013, followed by the finalized full 4.0 API specifications soon thereafter.

The OpenMP Language Commmittee really put a lot of effort and dedicated work into both documents and we hope for good, constructive feedback. Both documents are available at the OpenMP Specifications webpage: http://openmp.org/wp/openmp-specifications/.

Grab them now while they are hot :-).

Several Event Annoucements

These are just some announcements of upcoming events in which I am involved in a varying degree. The first two will be take place at RWTH Aachen University and attendance is free of charge, the second is part of the SC12 conference in Salt Lake City, UT in the US.

Tuning for bigSMP HPC Workshop – aixcelerate (October 8th – 10th, 2012). The number of cores per processor chip is increasing. Today’s “fat” compute nodes are equipped with up to 16 eight-core Intel Xeon processors, resulting in 128 phyiscal cores, with up to 2 TB of main memory. Furthermore, special solutions like a ScaleMP vSMP system may consist of 16 nodes with 4 eight-core Intel Xeon processors each and 4 TB of accumulated main memory, scaling the number of cores even further up to 1024 per machine.  While message-passing with MPI is the dominating paradigm for parallel programming in the domain of high performance computing (HPC), with the growing number of cores per cluster node the combination of MPI with shared memory programming is gaining importance. The efficient use of these systems also requires NUMA-aware data management. In order to exploit different levels of parallelism, namely through shared memory programming within a node and message-passing across the nodes, obtaining good performance becomes increasingly difficult.  This tuning workshop will in detail cover tools and methods to program big SMP systems. The first day will focus on OpenMP programming on big NUMA systems, the second day will focus on Intel Performance Tools as well as the ScaleMP machine, and the third day will focus on Hybrid Parallelization. Attendees are kindly requested to prepare and bring in their own code, if applicable. If you do not have an own code, but you are interested in the presented topics, you may work on prepared exercises during the lab time (hands-on). It is recommended to have good knowledge in MPI and/or OpenMP. More details and the registration link can be found at the event website.

OpenACC Tutorial Workshop (October 11th  to 12th, 2012). OpenACC is a directive-based programming model for accelerators which enables delegating the responsibility for low-level (e.g. CUDA or OpenCL) programming tasks to the compiler. To this end, using the OpenACC API, the programmer can easily offload compute-intensive loops to an attached accelerator. The open industry standard OpenACC has been introduced in November 2011 and supports accelerating regions of code in standard C, C++ and Fortran. It provides portability across operating systems, host CPUs and accelerators. Up to know, OpenACC compilers exist from Cray, PGI and CAPS. During this workshop, you will work with PGI’s OpenACC implementation on Nvidia Quadro 6000 GPUs. This OpenACC workshop is divided into two parts (with separate registrations!). In the first part, we will give an introduction to the OpenACC API while focusing on GPUs. It is open for everyone who is interested in the topic. In contrast to the first part, the second part will not contain any presentations or hands-on sessions. To the second day, we invite all programmers who have their own code and want to give it a try to accelerate it on a GPU using OpenACC and with the help of our team members and Nvidia staff. More details and the registration link can be found at the event website.

Advanced OpenMP Tutorial at SC12 (November 12th, 2012). With the increasing prevalence of multicore processors, shared-memory programming models are essential. OpenMP is a popular, portable, widely supported and easy-to-use shared-memory model. Developers usually find OpenMP easy to learn. However, they are often disappointed with the performance and scalability of the resulting code. This disappointment stems not from shortcomings of OpenMP but rather with the lack of depth with which it is employed. Our “Advanced OpenMP Programming” tutorial addresses this critical need by exploring the implications of possible OpenMP parallelization strategies, both in terms of correctness and performance. While we quickly review the basics of OpenMP programming, we assume attendees understand basic parallelization concepts and will easily grasp those basics. We discuss how OpenMP features are implemented and then focus on performance aspects, such as data and thread locality on NUMA architectures, false sharing, and private versus shared data. We discuss language features in-depth, with emphasis on features recently added to OpenMP such as tasking. We close with debugging, compare various tools, and illustrate how to avoid correctness pitfalls. More details can be found on the event website.