Tag Archives: C++0X

OpenMP 3.1 spec published as Draft for Public Comment

You might have heard it already: The next incarnation of the OpenMP specification, which is targeted to be released as version 3.1 around June in time for IWOMP 2011 in Chicago, has been published as a Draft for Public Comment. You may think of it as beta :-).

Back in October 2009, I already commented on some of the goals for versions 3.1 and 4.0. OpenMP 3.1 addresses some issues found in the 3.0 specification and brings only minor functional improvements, still it will be released with a delay of almost one year to our initially planned schedule. However, work on version 4.0 already made some significant progress, including support for accelerators (GPUs), further enhancements to the tasking model, and support for error handling. Taking the outline of my previous post on the development of OpenMP, this is the list of updates to be found in OpenMP 3.1 and the status of the development towards OpenMP 4.0 (expressed in my own words and stating my own beliefs and opinions):

1: Development of an OpenMP Error Model. There is nothing new on this topic in OpenMP 3.1. However, with respect to OpenMP 4.0, the so-called done directive has been discussed for quite some time already. It can be used to terminate the execution of a Parallel Region, or a Worksharing construct, or a Task construct, and it is a prominent candidate for the next OpenMP spec. It would provide necessary functionality towards full-featured error handling capabilities, for which there is no good proposal that could be agreed upon yet.

2: Interoperability and Composability. There is nothing new on this topic in OpenMP 3.1. We made several experiments, gained some insights, and the goal is to come up with a set of reliable expectations and assertions in the OpenMP 4.0 timeframe.

3: Incorporating Tools Support into the OpenMP Specification. There is currently no activity on this topic in the OpenMP Language Committee in general.

4: Associating Computation or Memory across Workshares. There is little progress in this direction to be found in OpenMP 3.1. The environment variable OMP_PROC_BIND has been added to control the binding of threads to processors, it accepts a boolean value. If enabled, the OpenMP runtime is instructed to not move OpenMP threads between processors. The mapping of threads to processors is unspecified and thus depends on the implementation. In general, introducing this variable that controls program-wide behavior was intended to standardize behavior found in almost all current OpenMP implementations.

5: Accelerators, GPUs and More. While there is nothing new on this topic in OpenMP 3.1, the Accelerator subcommittee put a lot of effort into coming up with a first (preliminary!) proposal. This is clearly interesting. From my personal point of view, OpenMP 4.0 might provide basic support for programming accelerators such as GPUs, thus delivering a vendor-neutral standard. Do not expect anything full-featured similar to CUDA, the current proposal is rather similar in spirit to the PGI Accelerator approach (which I do like). However, this is still far from being done, and may (or may not) change directions completely. The crucial aspects are to integrate well with the rest of OpenMP, and to provide an easy to use but still powerful approach to allow for bringing certain important code patterns to accelerator devices.

6: Transactional Memory and Thread Level Speculation. There is in general no activity on this topic in the OpenMP Language Committee and apparently it dropped from the set of important topics. Personally, (now) I do not think TM should be a target for OpenMP in the forseable future.

7: Refinements to the OpenMP Tasking Model. There have been some improvements to the Tasking model, with some more on the roadmap for OpenMP 4.0.

  • The taskyield directive has been added to allow for user-defined task scheduling (tsp) points. A tsp is a point in the execution of a task at which is can be suspended to be resumed later; or the event of task completion, after which the executing thread may switch to a different task.
  • The mergeable clause has been added to the list of possible task clauses, indicating that the task may have the same data region as the generating task region.
  • The final clause has been added to the list of possible task clauses, denoting the execution of all descending tasks sequentially in the same region. This implies immediate execution of final tasks, and ignoring any untied task clauses. An optional scalar expression allows for conditioning the application of the final clause.

8: Extending OpenMP to C++0x and FORTRAN 2003. There is nothing new on this topic in OpenMP 3.1. We closely follow the development of the base language and it has to be seen what can (or has to) be done for OpenMP 4.0. Anyhow, the fact that base languages are introducing threading and a thread-aware memory model leads to some simplifications on the one hand, but also could lead to potential conflicts on the other hand. We are not aware of any such conflict, but digging through the details and implification of a base language such as C++ as well as OpenMP is a pretty complex task.

9: Extending OpenMP to Additional Languages. There is nothing new on this topic in OpenMP 3.1, and currently there is no intention of doing so inside the OpenMP Language Committee. Personally, I would like to see an OpenMP binding for Java, since it would really help teaching parallel programming, but I do not see this happen.

10: Clarifications to the Existing Specifications. There have been plenty of clarification, corrections, and micro-updates. Most notably the examples and description in the appendix have been corrected, clarified, and expanded.

11: Miscellaneous Extensions. A couple of miscellaneous extensions made it into OpenMP 3.1:

  • The atomic construct has been extended to accept the following new clauses: read, write, update and capture. If none is given, it defaults to update. Specifying an atomic region allows to atomically read / write / update the value of the variable affected by the construct. Note that not everything inside an atomic region is performed atomically, i.e. the evaluation of “other” variables is not. For example in an atomic write construct, only the left-hand variable (the one that is written to) is written atomically.
  • The firstprivate clause now accepts const-qualified types in C/C++ as well as intent(in) in Fortran. This is just a reaction to annoyances reported by some users.
  • The reduction clause has been extended to allow for min and max reductions for built-in datatypes in C/C++. This still excludes aggregate types (including arrays) as well as pointer and reference types from being used in an OpenMP reduction. We had a proposal for powerful user-defined reductions (UDRs) on the table for a long time, it was discussed heavily, but did not make it into OpenMP 3.1. That would have made this release of the spec much stronger. Adding UDRs is a high priority for OpenMP 4.0 for many OpenMP Language Committee members, though.
  • omp_in_final() is as new API routine to determine whether it is called from within a final (aka included) task region.

12: Additional Task / Threads Synchronization Mechanisms. There is nothing new on this topic in OpenMP 3.1, and not much interest in the OpenMP Language Committee that I have noticed. However, we are thinking of task dependencies and task reductions for OpenMP 4.0, and both feature would probably fall into this category (and then there would be a high interest).

How OpenMP is moving towards version 3.1 / 4.0

Not yet carved in stone, but the current plan of the OpenMP Language Committee (LC) is to publish a draft OpenMP 3.1 standard for public comment by IWOMP 2010 and to have the OpenMP 3.1 specification finished for SC 2010 – given that the Architecture Review Board (ARB) accepts the new version. Bronis R. de Supinski (LLNL) has taken on the duty of acting as the chair of the LC and since introduced some process changes. Besides weekly telephone conference calls, there are three face-to-face meetings per year and attendance is required for voting rights. The first face-to-face meeting was held on June 1st and 2nd in Dresden attached to IWOMP 2009, the second one was on September 22nd and 23rd in Chicago. This blog post is intended to report on this last meeting and to present an overview of what is going on with OpenMP right now, obviously from my personal point of view.

In the course of resuming work on OpenMP after the 3.0 specification was published, the LC voted on the priority of (small) extensions and clarifications for 3.1 as well as new topics for 4.0. We ended up with 12 major topics and 5 subcommittees, as outlined in Bronis talk during IWOMP 2009, which are still in use as identifiers of the different topics people are working on.

1: Development of an OpenMP Error Model. This is the feature the LC people think OpenMP is missing most desperately, but in contrast to that it did not receive too much effort yet. A subcommittee has been formed to be lead by Tim Mattson (Intel) and Michael Wong (IBM), and currently there are three proposals on the table for discussion: (i) an extension of the API routines and some constructs to return error codes or the introduction of a global error indication variable, (ii) an exception-based mechanism to catch errors, and (iii) a callback-based mechanism allowing to react on errors based on the severity and origin. The absence of an error model is clearly a reason for not using OpenMP in applications with certain requirements on reliability, but introducing the wrong error model could easily spoil OpenMP for that audience. It seems that most LC people do not like error codes too much (I don’t either), using exceptions is not suitable for C and FORTRAN, so the third approach seems most promising by allowing a program to react on errors depending on the severity and to still allow the compiler to ignore OpenMP if it is not enabled. In fact, this mechanism has been proposed back in 2006 by Alex Duran (BSC) and friends already. Since nothing has been decided yet, I guess the error model is targeted for OpenMP 4.0.

2: Interoperability and Composability. This subcommittee is lead by myself (RWTH) and Bronis R. de Supinski (LLNL) and is looking for ways of allowing OpenMP to coexist with other threading packages, maybe even with other OpenMP runtime environments in the same application. We are also looking into how to allow the creation of parallel software components that can safely be plugged together, which I consider prominently missing in virtually all threading paradigms. This is a very broad topic and there is no OpenMP version number I would assign this topic as target for being solved to, but with a little bit of luck we can make some progress even for version 3.1. We have some ideas on the table of how to specify some basic aspects of OpenMP interacting with the native threading packages (POSIX-Threads on Linux/Unix, Win32-Threads on Windows), driven by application observations and known deficiencies in current OpenMP implementations. We might also attack the problem of orphaned reductions. I am not so certain of solving the issue of allowing or detecting nested Worksharing constructs, respectively.

3: Incorporating Tools Support into the OpenMP Specification. This has been on the feature wishlist for OpenMP 3.0 already, but there is hardly any activity regarding this topic. Most vendors provide their own tools to analyze the performance (or correctness) of OpenMP programs by making their own runtime talk to their specific tool, but this situation is far from optimal for research / academia tools. As early as back in 2004 there were some proposal (i.e. POMP by Bernd Mohr and friends), but they did not made it into the specification or into actual implementations.

4: Associating Computation or Memory across Workshares. Today, the world of OpenMP is flat (memory), so this topic is mostly about supporting cc-NUMA architectures in OpenMP. There are two subcommittees working on this issue, the first is lead by Dieter an Mey (RWTH) and the goal is to standardize common practices (used in today’s applications) of dealing with cc-NUMA optimizations. If nothing comes in between, OpenMP 3.1 will allow the user to bind threads to cores by either specifying an explicit mapping, or by telling the runtime a strategy (like compact vs. scatter). Of course there are more ideas (and features needed), like influencing the memory allocation scheme or using page migration if supported by the operating system or interacting with resource management systems (batch queuing systems), but these are very hard to specify in a portable and extensible fashion. The other subcommittee is lead by Barbara Chapman (UH) and deals with thread team control. Using the Worksharing in OpenMP, it is very hard to dedicate a special task (i.e. I/O) to just one thread of the Parallel Region. There are applications asking for that, but I don’t see a proposal that the LC would agree on for 3.1. Nevertheless, they presented some interesting ideas at the last F2F based based on HPCS language capabilities, which hopefully have the potential to influence OpenMP 4.0.

5: Accelerators, GPUs and More. Of course we have to follow the trend / hype ;-). But since no one knows for sure in which directions the hardware is evolving, there are so many different ideas on how to deal with this. Out of my head I can enumerate that PGI has some directives loosely based on OpenMP Worksharing (plus they have CUDA for FORTRAN), IBM has OpenMP for cell with several ideas on extensions, BSC has a proposal that is in principle based on their *SS concept, and CAPS Entreprise has the HMPP constructs + compiler. In summary: No clear direction yet, nothing for OpenMP in the scope of 3.1.

6: Transactional Memory and Thread Level Speculation. Some people thought that OpenMP might need something for Transactional Memory. To the best of my knowledge no one from the LC did any work on this regard.

7: Refinements to the OpenMP Tasking Model. There are two things that most people agree Tasks are missing: Dependencies and Reductions. With respect to the former, there were three proposals on the table from Grant Haab (Intel), Federico Massaioli (Caspur) and Alex Duran (BSC) and the BSC proposal looks most promising because it avoid deadlocks. It employs existing program variables to define the dependencies between tasks, i.e. the result of a computation can be the input of another task. With a good portion of luck, Task Dependencies could actually make it into OpenMP 3.1, I think. With respect to the latter thing, namely Task Reductions, there has been only little progress so far.

8: Extending OpenMP to C++0x and FORTRAN 2003. Since the C++0x standard dropped Concepts, the work that Michael Wong (IBM) and myself (RWTH) made so far became obsolete. To the best of my knowledge there has been no progress made with respect to investigate the opportunities or issues that could arise with FORTRAN 2003.

9: Extending OpenMP to Additional Languages. Well, there are Java and C#, and at least for Java there are some implementations of OpenMP available (incomplete, though). Anyhow, there was never any real attempt to write a formal specification of OpenMP for Java, nor for C#, and I don’t think there is one now.

10: Clarifications to the Existing Specifications. The LC already approved several minor corrections (i.e. mistakes in the examples, improvements in the wording, and the like) that will make their way into OpenMP 3.1. Nothing spectacular, though, but this is something that has to be done.

11: Miscellaneous Extensions. I might be wrong, but I think that User-defined Reductions (UDR) belong to this topic. Yes, there is a chance that UDRs will make it into OpenMP 3.1! This will bring obvious things like min and max for C and C++, but we are aiming higher: The goal is to enable the programmer to write any type of reduction operation for any type in the base language (including non-PODs) and this is achieved by introducing an OpenMP declare statement to define a reduction operation that can be specified in a reduction clause. There are two problems that are under discussion right now: (i) C++ templates and (ii) pointers / arrays. The first can be addressed by an extension of the current proposal and I got the feeling that most LC people like the new approach, but the second is a bit more complex. If you want to reduce an array that is described by a pointer, you need to know how much space to allocate for the thread private copy and how many elements the array consists of. There has been some discussion on this, but no strong agreement on how to solve this issue in general, as it also arises with the private, firstprivate, … clauses. We only agreed that we need a one-fits-all solution. With some good portion of luck we can solve this issue, otherwise we hopefully get UDRs with some limitations in OpenMP 3.1 and the full functionality in a later version of the specification.

12: Additional Task / Threads Synchronization Mechanisms. Again I might be wrong, but I think that the Atomic Extension proposal by Grant Haab (Intel) belongs in here. This is a feature you will also find in threading-aware languages (such as C++0x), but the current base languages of OpenMP are not of that kind. This will almost certainly make it into OpenMP 3.1 and will allow for a portable way to write atomic updates that capture a value and atomic writes. This is already supported by most machines and using an atomic operations can be so much more efficient than using a Critical Region.

If you are interested in more details, you are invited to stop by the OpenMP booth at SC 2009 in Portland and ask the nice guy on booth duty some good questions :-).

C++0x: OpenMP loop parallelization without pragmas?!

Some people are complaining that OpenMP’s approach of using pragmas to annotate a program is not very nice, as pragmas / the OpenMP directives are not well-integrated into the language. Personally, I like the OpenMP approach and think it has some specific advantages. But I am also very interested in researching how the OpenMP language bindings could be improved, especially for C++. This post is about using C++0x features to build parallelization constructs that have been praised in the context of other approaches (e.g. Parallel Extensions for C#, or Intel’s Threading Building Blocks), but using OpenMP constructs.

Let’s consider the following sequential loop which is very similar to the example used in the Microsoft Parallel Extensions to the .NET Framework 3.5 (June 2008) documentation:

01   double dStart, dEnd;
02   for (int rep = 0; rep < iNumRepetitions; rep++)
03   {04       dStart = omp_get_wtime();
05       for (int i = 0; i < iNumElements; i++)
06       {
07           vec[i] = compute(vec[i], iNumIterations);
08       }
09       dEnd = omp_get_wtime();10   }

The experiment loop (line 05 to line 08) is executed iNumRepetitions times, the time is taken in line 04 and 09 using OpenMP time measurement functions (portability!), and the time required for each element can be controlled via iNumIterations. I will use that parametrization for my performance experiments – for now let’s just look at how this would be parallelized in OpenMP:

#pragma omp parallel for shared(iNumElements, vec, iNumIterations)
        for (int i = 0; i < iNumElements; i++)
        {
            vec[i] = compute(vec[i], iNumIterations);
        }

Pretty straight forward – as this parallel loop is perfectly balanced, we do not need the schedule clause here. How could that loop look like without using pragmas? Maybe as shown here:

omp_pfor (0, iNumElements, [&](int i)
{
    vec[i] = compute(vec[i], iNumIterations);
});

Do you like that? No OpenMP pragma is visible in the user’s code, he just has to specify the loop iteration space and the loop variable, the parallelization is done “under the hood”. The implementation of this is pretty simple using lambda functions of C++0x:

template<typename F>
void omp_pfor(int start, int end, F x)
{
#pragma omp parallel for
    for(int __i = start; __i < end; __i++)
    {
        x(__i);
    }
}

Of course I am still using OpenMP directives here, but they are hidden as an implementation detail. The actual loop body is passed as an argument to the omp_pfor lambda function, as well as the loop boundaries. Please note that this is just a very simple example, of course one can handle all types of loops that are currently supported in OpenMP 3.0 (any maybe even more) and STL-type algorithms!

In this post I only talked about syntax, but there is more to it. A part of my research is looking into how programmers (especially from the background of computation engineering science in Aachen) can be provided with more powerful language-based tools to ease writing parallel and reusable code / components. I am always happy to discuss on such a topic – if you like the Live Space comment functionality as little as I do, just drop me a mail at christian@terboven.com.

You can download this example code from my website. In order to compile that code, I recommend using the latest Intel 11.0 beta compiler.