OpenMP 4.0 RC2 is well on it’s way

Long time no blog post. But I have good news to share today: Yesterday the OpenMP Language Committee (LC) hold the final votes on a set of tickets (read: extensions or corrections) to find their way into OpenMP 4.0 RC2, the second release candidate of OpenMP 4.0, the anticipated next version of the specification. These tickets are basically the outcome of the last LC meeting in January plus some first feedback we received on RC1. And they bring some very nice new features to OpenMP (some of which are well overdue).

Before I give a brief overview of the new additions, some remarks on the procedure leading to the final OpenMP 4.0 specification. Our aim is to have the RC2 document ready and published in roughly two weeks from now. This is now hard work for our editor Richard Friedman, as all tickets are written as a diff to the currently latest spec, namely RC1. After the release of RC2, we will again solicit feedback on the new spec draft. This feedback is important for the final voting by all OpenMP members in the Architecture Review Board (ARB), as the ARB is the owner of the spec and has to formally accept the new spec proposed by the LC. Only then – given majority acceptance in the ARB vote – OpenMP 4.0 will be released. During the feedback period, the LC still has to complete some aspects of the spec, which are not ready yet, like the appendix. Especially the examples are not complete. And if anyone of the public reviewers finds a serious flaw in any of the new extensions, we will face the problem of fixing if quickly (if possible) or withdrawing it from OpenMP 4.0. This means the following additions have very high probability to be part of OpenMP 4.0, but nothing is guaranteed yet.

Cancellation. At so-called cancellation points the implicit and explicit tasks check for whether cancellation has been requested and if so, they abort the current region and jump right to the end of it. Cancellation can be requested via the new cancel construct, which is able to cancel either the whole parallel region, the innermost sections worksharing construct, the innermost for worksharing construct, or the innermost taskgroup, with innermost always defined regarding to the thread team encountering the cancel construct. Control flow will resume right after the end of the cancelled region. A thread or task requesting cancellation  does not lead to immediate abort of all other threads or tasks in the respective region, instead these will only abort execution once a cancellation points has been reached. Cancellation points are part of  barriers, the cancel construct itself, or used-defined via the cancellation point construct. This addition is the first step to a fully-featured error model in OpenMP.

Task Dependencies. The optional depend clause on a task enforces additional constraints on the scheduling of a task by enabling dependencies  between sibling tasks. It is of the form depend(dependency-type: list) accepting a list of variables. If an in dependence-type is given, the generated task will be a dependent task of all previously generated sibling tasks that reference at least one of the list items in an out or inout clause. If an out or inout dependence-type is given, the generated task will be a dependent task of all previously generated sibling tasks that reference at least one of the list items in an in, out, or inout clause. The principle model you should have in mind with this feature is the flow of data: if a variable appears in an in-type depend clause of a given task, this given task has to wait for all task in which this particular variable appears in an out-type depend clause, as these tasks first have to write to the variable as otherwise the update would be lost. And vice versa. This is why out/inout-type dependences also enforce “waiting” for out/inout-dependences, as hereby an ordering of the task execution is enforced. In the current form of this feature, there is no observable difference in task scheduling between the out and inout dependence types, but we forsee to allow certain types of optimizations in the future if there is a distinction between out and inout. Additionally we found that having both types better maps to the data-flow model.

Array Sectioning. Several clauses in OpenMP 4.0 require the ability to describe a subset of native arrays, especially the support for accelerators in the target construct (see below). Array sectioning allows to define a subset of the elements in an array via [lower-bound : length], or [lower-bound :], or [: length] or [:]. The use of array sectioning is restricted to selected constructs and clauses.

Support for Accelerators. This is really the big new thing in OpenMP 4.0 which the LC aimed for getting ready for inclusion. The target construct allows for the execution of OpenMP constructs on a device other than the current host/device. The target data construct creates a device data environment and allows for the “mapping” of data between the different devices – for current accelerators this means copying data from the host to the device and vice versa. The declare target directive instructs the OpenMP implementation to create device-specific versions of the variables or functions specified, meaning they are available (for execution) on the device. In order to support vector-style operations of current accelerators, there are two new constructs: the teams construct creates several OpenMP thread teams (then called a league) of which only the master executes the associated region and the distribute construct specifies that the corresponding loop iterations will be executed by the thread teams. — I understand this very brief description cannot serve as a good explanation for this new feature, but I don’t have my examples ready yet. If you know OpenACC and/or the PGI Accelerator programming model, you probably got a clue of what will be in OpenMP. Personally, I regard OpenMP 4.0 with this extension as a superset of OpenACC, in which the common roots are visible. More information and documentation on this new feature, which I like to call “OpenMP for Accelerators (OpenMP4Acc)”, should become available with the release of the OpenMP 4.0 RC2 spec draft. By the latest on March 14th for our next PPCES event I will have a more detailed introduction along with some examples ready and will put them here as well.

Print Environment Settings. The the OMP_DISPLAY_ENV environment variables is set to true, the execution environment is instructed to display the OpenMP version number as well as the values of all the ICVs (ICV = Internal Control Variable) after evaluating the user options before starting the actual program execution. This is very helpful if one uses multiple environments.

Adding so many things to OpenMP 4.0 (don’t forget the new features already present in OpenMP 4.0 RC1) also has at least one obvious downside: the specification itself has become almost unreadable for the average OpenMP user. I cannot completely exclude myself here, although I spent a reasonable amount of my work time dealing with OpenMP itself. This clearly underlines the need for good books on OpenMP 4.0 programming, but I am not aware of anyone currently working on such a thing. Several members of the LC as well as well-know instructors from academia will for sure add OpenMP 4.0 aspects to their lectures and tutorials soon, but this is only the first tiny step towards OpenMP 4.0 adoption. I am curious to see how programmers will pick up the new goodness…

About these ads
This entry was posted in Future of HPC, OpenACC, OpenMP and tagged , , , , , . Bookmark the permalink.

4 Responses to OpenMP 4.0 RC2 is well on it’s way

  1. Graham says:

    I’ve been reviewing the support for accelerators in the March RC2 document, but am sorely missing examples; a simple GPU array element increment would be marvelous. Did you complete the detailed introduction with examples you were hoping for by March 14th?

  2. terboven says:

    The release of the final OpenMP 4.0 specification is currently expected for early July. But we decided to decouple the examples from the actual specification and they might be released later. The (normative) spec is really meant for OpenMP implementers, not for users. Personally, I think that with every version the spec is becoming less readable for OpenMP users and we really need good books on that topic!

    I did not find the time to write about my OpenMP 4.0 for Accelerator examples here on this blog. Below are two links to slide decks I used at some occasions to talk about OpenMP 4.0 support for Accelerators and SIMD vectorization, they contain code snippets / examples and should pretty much resemble the final API – however, things might still change while the spec has not been released. Please let me know if the links do not work for you.

    OpenMP for Accelerators: https://skydrive.live.com/redir.aspx?cid=b7be35f701a0d7d4&resid=B7BE35F701A0D7D4%217804&parid=B7BE35F701A0D7D4%21818&authkey=%21AOuDsuIkq2cmFEk
    Excerpt from Parallel 2013 talk, SIMD and OpenMP for Accelerators: https://skydrive.live.com/redir.aspx?cid=b7be35f701a0d7d4&resid=B7BE35F701A0D7D4%217805&parid=B7BE35F701A0D7D4%21818&authkey=%21AOKGMCEuPmK1TzM

  3. Graham says:

    Many thanks. As an implementer I also find the examples very useful, but I will be glad if their separation from the standard can allow them to flourish; though I expect they must remain static between spec releases.

    In OMP4-OpenMP_for_Accelerators.pdf, in the SAXPY example, I’m surprised to see that you don’t need to specify a size for x and y. In this specific example, the compiler should have no trouble, but this will not be the case in general. May I ask, is that aspect conformant? Also, on slide 9, I had thought the distribute construct was associated with the loop nest following it. Wouldn’t that be the outermost loop here?

  4. terboven says:

    Which implementation are you working on?

    Regarding your first question: The spec states “The list items that appear in a map clause may include array sections” (array sections are also a new OpenMP 4.0 feature). You are right, in the given example the compiler can figure out the size automatically, but you could also write “#pragma omp target data map(to:x[0:10240])”. When the size of the array dimension is not known, the length of an array section must be specified explicitly.

    Regarding you second question: The final release will allow to write “#pragma omp teams distribute parallel for” (very similar to “#pragma acc loop gang vector”), so the blocking I did on that slide is no longer necessary. In my example, that was using only RC2 functionality, I needed the “distribute” to distribute the i-loop iterations among the teams, and then start one parallel region per team using the “parallel for”. Does this clarify my code example?

Comments are closed.