OpenMP 4.0 almost ready after recent F2F meeting

Last week’s OpenMP Language Committee face-to-face (F2F) meeting was meant to resolve the final outstanding issues to get the OpenMP 4.0 specification ready. With this week’s concall I assume we achieved just that and now it is our editor’s turn to apply all remaining tickets to the spec document. After that, the OpenMP ARB will perform the official vote on July 11th (if my calendar is correct), which in case of a positive vote will then also be the release date of the OpenMP 4.0 spec. This voting is generally considered just a formality, as the OpenMP member companies and institutions sending staff to the Language Committee also constitute the OpenMP ARB. OpenMP 4.0 will not break existing codes.

If you are interested in learning about the new features, you may want to stop by at the JARA-HPC booth #755 at ISC in Leipzig next week. We have (preliminary) OpenMP 4.0 syntax reference cards as handouts for you. If you want to meet me in person, you are welcome to visit the booth during my booth duties on Monday (11:30h to 13:00h), Tuesday (11:30 to 13:00h) or Wednesday (13:00h to 14:30h).

Posted in Future of HPC, OpenACC, OpenMP | Tagged , , , , ,

Advanced OpenMP Tutorial @ ISC in June in Leipzig

The International Supercomputing Conference (ISC) will take place in Leipzig, Germany, next week from June 16th to June 20th, 2013. This year the program contains tutorials again and the team of Bronis de Supinski (LLNL), Michael Klemm (Intel) and myself will offer the Advanced OpenMP Programming tutorial on June 16th, 9:00 AM to 1:00 PM. If you are interested in learning about performance-focused OpenMP programming and the new features in OpenMP 4.0, this might be the right one for you, although we obviously cannot cover everything in detail in just the 4 hours we got. We asked for a full day, but got only a half one.

While we quickly review the basics of OpenMP programming, we assume attendees understand basic parallelization concepts and will easily grasp those basics. We focus on performance aspects, such as data and thread locality on NUMA architectures, false sharing, and exploitation of SIMD vector units. We discuss language features in-depth, with emphasis on features recently added to OpenMP such as tasking. We close with an overview of the new OpenMP 4.0 directives for attached compute accelerators. This is our detailed agenda:

  1. OpenMP Overview (15 minutes)
    1. Core Concepts: Parallel Region, Worksharing, Nesting
    2. Synchronization: Synchronization Constructs and the Memory Model
  2. Techniques to Obtain High Performance with OpenMP: Memory Access (45 minutes)
    1. Understanding Memory Access Patterns
    2. Memory Placement and Thread Binding
    3. Performance Tips and Tricks: Avoiding False Sharing, Private versus Shared Data
  3. Techniques to Obtain High Performance with OpenMP: Vectorization (30 minutes)
    1. Understanding Vector Microarchitectures
    2. Vectorization with OpenMP 4.0
  4. Advanced Language features (60 minutes)
    1. The OpenMP Tasking Model
    2. Tasking in Detail: Final, Mergeable, and Dependencies
    3. Cancellation
    4. Misc. OpenMP 4.0 Features: Controlling the Implementation, Reduction Extensions, Improved Atomic Support
  5. OpenMP for Attached Compute Accelerators (45 minutes)
    1. The OpenMP Execution Model for Devices
    2. Target Construct
    3. OpenMP on the Intel Xeon Phi Coprocessor Examples
  6. 6. Future OpenMP Directions (15 minutes)
    1. Comprehensive OpenMP new Features Overview
    2. OpenMP 4.0 and beyond Status, Directions and Schedule
    3. Open Discussion of Possible OpenMP Extensions (until we got thrown out of the room or people have left for lunch)
Posted in Future of HPC, OpenMP | Tagged , , , , ,

OpenMP 4.0 RC2 has been released

In addition to the new features already present in release candidate one (RC1), the second draft of the next OpenMP specification release contains the following additions (quoted from openmp.org):

  • Initial accelerators support: Device Data Environments (p16), target constructs (p68: target, target data, target update, declare target, teams, distribute; p151: map clause; and associated runtime routines (p191).)
  • Task dependency support through the new depend clause. (p91)
  • Initial error model support through cancel and cancellation point constructs to request cancellation of specified region types and to declare a user-defined cancellation point to  check for cancellation requests. (Section 2.13, p116: Cancellation Constructs)
  • Support for array sections in C, C++ and Fortran. (Section 2.4, p36: Array Sections)
  • Extended declare simd directive to allow multiple declarations. (p64)
  • New environment variable OMP_DISPLAY_ENV instructing the runtime to display the OpenMP version number and ICV values during initialization. (p219)
  • Additional enhancements to support Fortran 2003.

As we were not yet able to incorporate all the feedback that has been reported so far, a few know issues are still in the document. Additionally, some more minor changes are already in preparation. Feedback and questions are of course still welcomed, so head over to http://openmp.org/wp/openmp-specifications/ and download the new document.

Posted in Future of HPC, OpenACC, OpenMP | Tagged , , ,

OpenMP 4.0 RC2 is well on it’s way

Long time no blog post. But I have good news to share today: Yesterday the OpenMP Language Committee (LC) hold the final votes on a set of tickets (read: extensions or corrections) to find their way into OpenMP 4.0 RC2, the second release candidate of OpenMP 4.0, the anticipated next version of the specification. These tickets are basically the outcome of the last LC meeting in January plus some first feedback we received on RC1. And they bring some very nice new features to OpenMP (some of which are well overdue).

Before I give a brief overview of the new additions, some remarks on the procedure leading to the final OpenMP 4.0 specification. Our aim is to have the RC2 document ready and published in roughly two weeks from now. This is now hard work for our editor Richard Friedman, as all tickets are written as a diff to the currently latest spec, namely RC1. After the release of RC2, we will again solicit feedback on the new spec draft. This feedback is important for the final voting by all OpenMP members in the Architecture Review Board (ARB), as the ARB is the owner of the spec and has to formally accept the new spec proposed by the LC. Only then – given majority acceptance in the ARB vote – OpenMP 4.0 will be released. During the feedback period, the LC still has to complete some aspects of the spec, which are not ready yet, like the appendix. Especially the examples are not complete. And if anyone of the public reviewers finds a serious flaw in any of the new extensions, we will face the problem of fixing if quickly (if possible) or withdrawing it from OpenMP 4.0. This means the following additions have very high probability to be part of OpenMP 4.0, but nothing is guaranteed yet.

Cancellation. At so-called cancellation points the implicit and explicit tasks check for whether cancellation has been requested and if so, they abort the current region and jump right to the end of it. Cancellation can be requested via the new cancel construct, which is able to cancel either the whole parallel region, the innermost sections worksharing construct, the innermost for worksharing construct, or the innermost taskgroup, with innermost always defined regarding to the thread team encountering the cancel construct. Control flow will resume right after the end of the cancelled region. A thread or task requesting cancellation  does not lead to immediate abort of all other threads or tasks in the respective region, instead these will only abort execution once a cancellation points has been reached. Cancellation points are part of  barriers, the cancel construct itself, or used-defined via the cancellation point construct. This addition is the first step to a fully-featured error model in OpenMP.

Task Dependencies. The optional depend clause on a task enforces additional constraints on the scheduling of a task by enabling dependencies  between sibling tasks. It is of the form depend(dependency-type: list) accepting a list of variables. If an in dependence-type is given, the generated task will be a dependent task of all previously generated sibling tasks that reference at least one of the list items in an out or inout clause. If an out or inout dependence-type is given, the generated task will be a dependent task of all previously generated sibling tasks that reference at least one of the list items in an in, out, or inout clause. The principle model you should have in mind with this feature is the flow of data: if a variable appears in an in-type depend clause of a given task, this given task has to wait for all task in which this particular variable appears in an out-type depend clause, as these tasks first have to write to the variable as otherwise the update would be lost. And vice versa. This is why out/inout-type dependences also enforce “waiting” for out/inout-dependences, as hereby an ordering of the task execution is enforced. In the current form of this feature, there is no observable difference in task scheduling between the out and inout dependence types, but we forsee to allow certain types of optimizations in the future if there is a distinction between out and inout. Additionally we found that having both types better maps to the data-flow model.

Array Sectioning. Several clauses in OpenMP 4.0 require the ability to describe a subset of native arrays, especially the support for accelerators in the target construct (see below). Array sectioning allows to define a subset of the elements in an array via [lower-bound : length], or [lower-bound :], or [: length] or [:]. The use of array sectioning is restricted to selected constructs and clauses.

Support for Accelerators. This is really the big new thing in OpenMP 4.0 which the LC aimed for getting ready for inclusion. The target construct allows for the execution of OpenMP constructs on a device other than the current host/device. The target data construct creates a device data environment and allows for the “mapping” of data between the different devices – for current accelerators this means copying data from the host to the device and vice versa. The declare target directive instructs the OpenMP implementation to create device-specific versions of the variables or functions specified, meaning they are available (for execution) on the device. In order to support vector-style operations of current accelerators, there are two new constructs: the teams construct creates several OpenMP thread teams (then called a league) of which only the master executes the associated region and the distribute construct specifies that the corresponding loop iterations will be executed by the thread teams. — I understand this very brief description cannot serve as a good explanation for this new feature, but I don’t have my examples ready yet. If you know OpenACC and/or the PGI Accelerator programming model, you probably got a clue of what will be in OpenMP. Personally, I regard OpenMP 4.0 with this extension as a superset of OpenACC, in which the common roots are visible. More information and documentation on this new feature, which I like to call “OpenMP for Accelerators (OpenMP4Acc)”, should become available with the release of the OpenMP 4.0 RC2 spec draft. By the latest on March 14th for our next PPCES event I will have a more detailed introduction along with some examples ready and will put them here as well.

Print Environment Settings. The the OMP_DISPLAY_ENV environment variables is set to true, the execution environment is instructed to display the OpenMP version number as well as the values of all the ICVs (ICV = Internal Control Variable) after evaluating the user options before starting the actual program execution. This is very helpful if one uses multiple environments.

Adding so many things to OpenMP 4.0 (don’t forget the new features already present in OpenMP 4.0 RC1) also has at least one obvious downside: the specification itself has become almost unreadable for the average OpenMP user. I cannot completely exclude myself here, although I spent a reasonable amount of my work time dealing with OpenMP itself. This clearly underlines the need for good books on OpenMP 4.0 programming, but I am not aware of anyone currently working on such a thing. Several members of the LC as well as well-know instructors from academia will for sure add OpenMP 4.0 aspects to their lectures and tutorials soon, but this is only the first tiny step towards OpenMP 4.0 adoption. I am curious to see how programmers will pick up the new goodness…

Posted in Future of HPC, OpenACC, OpenMP | Tagged , , , , , | 4 Comments

OpenMP 4.0 RC1 and the Accelerator TR available

Quoting from openmp.org: OpenMP, the de-facto standard for parallel programming on shared memory systems, continues to extend its reach beyond pure HPC to include embedded systems, real time systems, and accelerators. Release Candidate 1 of the OpenMP 4.0 API specifications currently under development is now available for public discussion. This update includes thread affinity, initial support for Fortran 2003, SIMD constructs to vectorize both serial and parallelized loops, user-defined reductions, and sequentially consistent atomics. The OpenMP ARB plans to integrate the Technical Report on directives for attached accelerators, as well as more new features, in a final Release Candidate 2, to appear sometime in the first Quarter of 2013, followed by the finalized full 4.0 API specifications soon thereafter.

The OpenMP Language Commmittee really put a lot of effort and dedicated work into both documents and we hope for good, constructive feedback. Both documents are available at the OpenMP Specifications webpage: http://openmp.org/wp/openmp-specifications/.

Grab them now while they are hot :-).

Posted in Future of HPC, NUMA, OpenACC, OpenMP | Tagged , , , , , , ,

Expect big OpenMP 4.0 news for SC12

Expect big news on OpenMP 4.0 for next week’s SC12. The OpenMP Language Committee – responsible for developing the standard – always planned to release the next version of the standard as a draft for public comment in time for SC12. We worked very hard during the last weeks to stay within our schedule. And we will do the following:

  • Release OpenMP 4.0 RC1 as a draft for public review. This document will be in a pretty good shape and will represent the foundation of OpenMP 4.0. It will contain several new features, to be discussed and explained during SC12 at our booth and/or the OpenMP BoF. Among these new features is the SIMD construct, to vectorize both serial as well as parallelized loops, taskgroups (no task dependencies yet), thread binding via places (I talked a lot on this already), array sectioning, basic support for Fortran 2003, and some other minor corrections and improvements.
  • Publish a Technical Report on OpenMP for Accelerators, more specifically on “Directives for Attached Accelerators”. This was always planned to be the major addition for OpenMP 4.0. However, integrating support for accelerators with the rest of OpenMP is a hard task and a lot of work, and it is not 100% done yet. There were many discussion on how to deal with this situation: do as outlined here, wait for just some more weeks, come up with a completely new schedule and wait until we are completely done, … . Almost all technical aspects have been discussed and answered. But the wording is not yet completed. And support for NVIDIA-like GPUs might not be optimal. However, I personally think the proposal is really good and the big opportunity in making the current state of work public is that the HPC community can take a look at it, think about it, comment on it, and possibly improve it. It is already online: http://openmp.org/wp/openmp-specifications/.

Hoping for constructive feedback and taking the additional time to work on the OpenMP for Accelerator extension, the current plan is to come up with a second draft for public comment (RC2) in January 2013 and then finalize the standard quickly after. Quickly in terms of a few weeks. This plan is still ambitious, but I think this is a good plan.

If you want to learn more, come to the OpenMP booth, and come to the BoF on Tuesday afternoon, 17:30h, which unluckily I will not be able to attend myself :-/. Listen to what the people will show you and let us know what you like and what you dislike.

Posted in Future of HPC, NUMA, OpenACC, OpenMP | Tagged , , , , , , , | 2 Comments

A Glimpse at OpenMP for Accelerators (aka OpenACC v2?)

During our OpenACC Workshop I contributed a brief talk on the current status of the OpenMP for Accelerators proposal. It caused some interest, because if successful, this proposal will be the de-facto successor of OpenACC, fully integrated into the rest of OpenMP. Hence I wanted to share this slide deck, but please understand that some of the information presented in there is changing on a daily base! We expect the concepts to remain valid, but i.e. the spelling seems to change quite quickly – understand this slide deck as a snapshot of October 11th, 2012. I once was critical how OpenACC was born, but since came to realize it was a good move and helped a lot to gain experiences with a pragma-based paradigm to program accelerators. Furthermore, users have something to work with already, instead of still waiting for a standard to be completed…

The OpenMP for Accelerators subcommittee is run by James Beyer (Cray) and Eric Stotzer (TI), who do a great job of documenting the current state of the discussion. This made it pretty easy for me to compile the slide deck and keep colleagues as well as users informed.

Posted in Future of HPC, OpenACC, OpenMP | Tagged , , | 1 Comment