Articles and Tutorials on OpenMP 4.0

You should have heard by now that OpenMP 4.0 has finally been released, you can find the official statement on It really is a major new release and therefore it will take a while until all implementations have incorporated all new features. Nevertheless, as some implementers already offer beta releases of their compiler products with some new OpenMP 4.0 features available, you might be interested in learning more about the new standard to get your hands dirty. In this blog post I collected links to the OpenMP 4.0 material I am currently aware of and give pointers to places and events at which you can learn more.

First, if you are fine with reading a German article, my friend Michael Klemm and I have written an overview piece discussing the most important changes and new additions (from our point of view), including some code examples. It has been published at heise Developer here: Together we also gave a corresponding presentation at parallel 2013, of which I made the slides available on my blog (slides in English), again with several code snippets.

End of July / early August we hold our “Parallel Programming Summer Course” at Aachen, during which OpenMP occupied two days of the agenda. The course material contains three slide decks on OpenMP which give a thorough introduction (I hope) into OpenMP Programming and touch the following new OpenMP 4.0 feature: device construct, task dependencies, thread affinity, array sections and user-defined reductions. I gave very similar talks at the Hartree Centre Summer School 2013.

Rolf Rabenseifner from HLRS also holds many very good courses on parallel programming. He is currently extending his material to cover selected OpenMP 4.0 topics, probably for the next course instance already.

If you attended ISC’13 in Leipzig, you had the chance to hear Bronis de Supinsky, Michael Klemm and myself in the half-day Advanced OpenMP Tutorial. Our slides are part of the tutorial proceedings.

At SC13 in Denver the same group plus Ruud van der Pas will talk about Advanced OpenMP: Performance and 4.0 Features, see This will be the first time we will focus in great detail on new features of OpenMP 4.0 and how to exploit those for programmability and performance. And finally at Euro-Par 2013 together with Tim Mattson I will be giving a half-day tutorial on Advanced OpenMP again, this time even more focussing on lower-level system details like the memory model and cache coherency mechanisms.

OpenMP 4.0 almost ready after recent F2F meeting

Last week’s OpenMP Language Committee face-to-face (F2F) meeting was meant to resolve the final outstanding issues to get the OpenMP 4.0 specification ready. With this week’s concall I assume we achieved just that and now it is our editor’s turn to apply all remaining tickets to the spec document. After that, the OpenMP ARB will perform the official vote on July 11th (if my calendar is correct), which in case of a positive vote will then also be the release date of the OpenMP 4.0 spec. This voting is generally considered just a formality, as the OpenMP member companies and institutions sending staff to the Language Committee also constitute the OpenMP ARB. OpenMP 4.0 will not break existing codes.

If you are interested in learning about the new features, you may want to stop by at the JARA-HPC booth #755 at ISC in Leipzig next week. We have (preliminary) OpenMP 4.0 syntax reference cards as handouts for you. If you want to meet me in person, you are welcome to visit the booth during my booth duties on Monday (11:30h to 13:00h), Tuesday (11:30 to 13:00h) or Wednesday (13:00h to 14:30h).

Advanced OpenMP Tutorial @ ISC in June in Leipzig

The International Supercomputing Conference (ISC) will take place in Leipzig, Germany, next week from June 16th to June 20th, 2013. This year the program contains tutorials again and the team of Bronis de Supinski (LLNL), Michael Klemm (Intel) and myself will offer the Advanced OpenMP Programming tutorial on June 16th, 9:00 AM to 1:00 PM. If you are interested in learning about performance-focused OpenMP programming and the new features in OpenMP 4.0, this might be the right one for you, although we obviously cannot cover everything in detail in just the 4 hours we got. We asked for a full day, but got only a half one.

While we quickly review the basics of OpenMP programming, we assume attendees understand basic parallelization concepts and will easily grasp those basics. We focus on performance aspects, such as data and thread locality on NUMA architectures, false sharing, and exploitation of SIMD vector units. We discuss language features in-depth, with emphasis on features recently added to OpenMP such as tasking. We close with an overview of the new OpenMP 4.0 directives for attached compute accelerators. This is our detailed agenda:

  1. OpenMP Overview (15 minutes)
    1. Core Concepts: Parallel Region, Worksharing, Nesting
    2. Synchronization: Synchronization Constructs and the Memory Model
  2. Techniques to Obtain High Performance with OpenMP: Memory Access (45 minutes)
    1. Understanding Memory Access Patterns
    2. Memory Placement and Thread Binding
    3. Performance Tips and Tricks: Avoiding False Sharing, Private versus Shared Data
  3. Techniques to Obtain High Performance with OpenMP: Vectorization (30 minutes)
    1. Understanding Vector Microarchitectures
    2. Vectorization with OpenMP 4.0
  4. Advanced Language features (60 minutes)
    1. The OpenMP Tasking Model
    2. Tasking in Detail: Final, Mergeable, and Dependencies
    3. Cancellation
    4. Misc. OpenMP 4.0 Features: Controlling the Implementation, Reduction Extensions, Improved Atomic Support
  5. OpenMP for Attached Compute Accelerators (45 minutes)
    1. The OpenMP Execution Model for Devices
    2. Target Construct
    3. OpenMP on the Intel Xeon Phi Coprocessor Examples
  6. 6. Future OpenMP Directions (15 minutes)
    1. Comprehensive OpenMP new Features Overview
    2. OpenMP 4.0 and beyond Status, Directions and Schedule
    3. Open Discussion of Possible OpenMP Extensions (until we got thrown out of the room or people have left for lunch)

OpenMP 4.0 RC2 has been released

In addition to the new features already present in release candidate one (RC1), the second draft of the next OpenMP specification release contains the following additions (quoted from

  • Initial accelerators support: Device Data Environments (p16), target constructs (p68: target, target data, target update, declare target, teams, distribute; p151: map clause; and associated runtime routines (p191).)
  • Task dependency support through the new depend clause. (p91)
  • Initial error model support through cancel and cancellation point constructs to request cancellation of specified region types and to declare a user-defined cancellation point to  check for cancellation requests. (Section 2.13, p116: Cancellation Constructs)
  • Support for array sections in C, C++ and Fortran. (Section 2.4, p36: Array Sections)
  • Extended declare simd directive to allow multiple declarations. (p64)
  • New environment variable OMP_DISPLAY_ENV instructing the runtime to display the OpenMP version number and ICV values during initialization. (p219)
  • Additional enhancements to support Fortran 2003.

As we were not yet able to incorporate all the feedback that has been reported so far, a few know issues are still in the document. Additionally, some more minor changes are already in preparation. Feedback and questions are of course still welcomed, so head over to and download the new document.

OpenMP 4.0 RC2 is well on it’s way

Long time no blog post. But I have good news to share today: Yesterday the OpenMP Language Committee (LC) hold the final votes on a set of tickets (read: extensions or corrections) to find their way into OpenMP 4.0 RC2, the second release candidate of OpenMP 4.0, the anticipated next version of the specification. These tickets are basically the outcome of the last LC meeting in January plus some first feedback we received on RC1. And they bring some very nice new features to OpenMP (some of which are well overdue).

Before I give a brief overview of the new additions, some remarks on the procedure leading to the final OpenMP 4.0 specification. Our aim is to have the RC2 document ready and published in roughly two weeks from now. This is now hard work for our editor Richard Friedman, as all tickets are written as a diff to the currently latest spec, namely RC1. After the release of RC2, we will again solicit feedback on the new spec draft. This feedback is important for the final voting by all OpenMP members in the Architecture Review Board (ARB), as the ARB is the owner of the spec and has to formally accept the new spec proposed by the LC. Only then – given majority acceptance in the ARB vote – OpenMP 4.0 will be released. During the feedback period, the LC still has to complete some aspects of the spec, which are not ready yet, like the appendix. Especially the examples are not complete. And if anyone of the public reviewers finds a serious flaw in any of the new extensions, we will face the problem of fixing if quickly (if possible) or withdrawing it from OpenMP 4.0. This means the following additions have very high probability to be part of OpenMP 4.0, but nothing is guaranteed yet.

Cancellation. At so-called cancellation points the implicit and explicit tasks check for whether cancellation has been requested and if so, they abort the current region and jump right to the end of it. Cancellation can be requested via the new cancel construct, which is able to cancel either the whole parallel region, the innermost sections worksharing construct, the innermost for worksharing construct, or the innermost taskgroup, with innermost always defined regarding to the thread team encountering the cancel construct. Control flow will resume right after the end of the cancelled region. A thread or task requesting cancellation  does not lead to immediate abort of all other threads or tasks in the respective region, instead these will only abort execution once a cancellation points has been reached. Cancellation points are part of  barriers, the cancel construct itself, or used-defined via the cancellation point construct. This addition is the first step to a fully-featured error model in OpenMP.

Task Dependencies. The optional depend clause on a task enforces additional constraints on the scheduling of a task by enabling dependencies  between sibling tasks. It is of the form depend(dependency-type: list) accepting a list of variables. If an in dependence-type is given, the generated task will be a dependent task of all previously generated sibling tasks that reference at least one of the list items in an out or inout clause. If an out or inout dependence-type is given, the generated task will be a dependent task of all previously generated sibling tasks that reference at least one of the list items in an in, out, or inout clause. The principle model you should have in mind with this feature is the flow of data: if a variable appears in an in-type depend clause of a given task, this given task has to wait for all task in which this particular variable appears in an out-type depend clause, as these tasks first have to write to the variable as otherwise the update would be lost. And vice versa. This is why out/inout-type dependences also enforce “waiting” for out/inout-dependences, as hereby an ordering of the task execution is enforced. In the current form of this feature, there is no observable difference in task scheduling between the out and inout dependence types, but we forsee to allow certain types of optimizations in the future if there is a distinction between out and inout. Additionally we found that having both types better maps to the data-flow model.

Array Sectioning. Several clauses in OpenMP 4.0 require the ability to describe a subset of native arrays, especially the support for accelerators in the target construct (see below). Array sectioning allows to define a subset of the elements in an array via [lower-bound : length], or [lower-bound :], or [: length] or [:]. The use of array sectioning is restricted to selected constructs and clauses.

Support for Accelerators. This is really the big new thing in OpenMP 4.0 which the LC aimed for getting ready for inclusion. The target construct allows for the execution of OpenMP constructs on a device other than the current host/device. The target data construct creates a device data environment and allows for the “mapping” of data between the different devices – for current accelerators this means copying data from the host to the device and vice versa. The declare target directive instructs the OpenMP implementation to create device-specific versions of the variables or functions specified, meaning they are available (for execution) on the device. In order to support vector-style operations of current accelerators, there are two new constructs: the teams construct creates several OpenMP thread teams (then called a league) of which only the master executes the associated region and the distribute construct specifies that the corresponding loop iterations will be executed by the thread teams. — I understand this very brief description cannot serve as a good explanation for this new feature, but I don’t have my examples ready yet. If you know OpenACC and/or the PGI Accelerator programming model, you probably got a clue of what will be in OpenMP. Personally, I regard OpenMP 4.0 with this extension as a superset of OpenACC, in which the common roots are visible. More information and documentation on this new feature, which I like to call “OpenMP for Accelerators (OpenMP4Acc)”, should become available with the release of the OpenMP 4.0 RC2 spec draft. By the latest on March 14th for our next PPCES event I will have a more detailed introduction along with some examples ready and will put them here as well.

Print Environment Settings. The the OMP_DISPLAY_ENV environment variables is set to true, the execution environment is instructed to display the OpenMP version number as well as the values of all the ICVs (ICV = Internal Control Variable) after evaluating the user options before starting the actual program execution. This is very helpful if one uses multiple environments.

Adding so many things to OpenMP 4.0 (don’t forget the new features already present in OpenMP 4.0 RC1) also has at least one obvious downside: the specification itself has become almost unreadable for the average OpenMP user. I cannot completely exclude myself here, although I spent a reasonable amount of my work time dealing with OpenMP itself. This clearly underlines the need for good books on OpenMP 4.0 programming, but I am not aware of anyone currently working on such a thing. Several members of the LC as well as well-know instructors from academia will for sure add OpenMP 4.0 aspects to their lectures and tutorials soon, but this is only the first tiny step towards OpenMP 4.0 adoption. I am curious to see how programmers will pick up the new goodness…

OpenMP 4.0 RC1 and the Accelerator TR available

Quoting from OpenMP, the de-facto standard for parallel programming on shared memory systems, continues to extend its reach beyond pure HPC to include embedded systems, real time systems, and accelerators. Release Candidate 1 of the OpenMP 4.0 API specifications currently under development is now available for public discussion. This update includes thread affinity, initial support for Fortran 2003, SIMD constructs to vectorize both serial and parallelized loops, user-defined reductions, and sequentially consistent atomics. The OpenMP ARB plans to integrate the Technical Report on directives for attached accelerators, as well as more new features, in a final Release Candidate 2, to appear sometime in the first Quarter of 2013, followed by the finalized full 4.0 API specifications soon thereafter.

The OpenMP Language Commmittee really put a lot of effort and dedicated work into both documents and we hope for good, constructive feedback. Both documents are available at the OpenMP Specifications webpage:

Grab them now while they are hot :-).