Last week the first ever OpenMPCon and the IWOMP 2015 workshops took place here in Aachen and this week we hosted the OpenMP Language Committee face-to-face meeting. An important goal of this meeting was to address the remaining open issues and then to complete the specification work to make the next version of OpenMP available in time for SC15. The OpenMP 4.1 draft was released in July this year and the comment period was open until the end of September. However, after realizing how many changes and particularly improvements this next version will bring, we decided that we want to call it OpenMP 4.5. Keeping the major version at “4” assures that the changes will not break any existing code.
Within the next few weeks the Language Committee will make the final changes to the spec text and a final verification run, then the document will be handed over to the OpenMP Architecture Review Board (ARB) for the voting. It can be expected that the ARB will accept the document as the new version of the OpenMP standard. It will then be released early at SC15. If you like to know what it is and talk to us, come to our tutorial on Advanced OpenMP Programming on the Monday right before SC15.
Actually the number of changes is large, about 130 tickets have been passed. Any change to the spec text is represented by a ticket capturing the changes in LaTeX code and of course the corresponding discussion(s). However, there are tickets of different size: some are small and contain only minor corrections, others are large and bring lots of new functionality (I think the target stuff of OpenMP 4.0 was captured in two huge tickets).
OpenMP 4.5 will obviously bring many clarifications and some minor corrections, but also some notable enhancements:
We handled several items from the Fortran 2003 todo list. Fortran 2003 is now supported as a base language with a few exceptions mentioned explicitly.
SIMD and Tasking extensions and refinements made its way into OpenMP 4.5.
Finally, OpenMP will support reductions for C/C++ arrays and templates.
Runtime routines to support cancellation and affinity have been added.
We also introduced some new features:
Support for doacross loops.
Loops can now be divided into tasks with the taskloop construct.
I plan to talk about some of these here in detail.
At IWOMP, Bronis de Supinski from LLNL, who is the Chair of the OpenMP Language Committee, gave a talk on the State of OpenMP & Outlook on OpenMP 4.1 (back then we did not had decided to call it 4.5). We will make all the IWOMP talks available on the IWOMP homepage soon, but here are two of his slides outlining the most important new additions above from what I mentioned already:
During his talk he also outlined what is on the agenda for OpenMP 5.0:
Task dependency support through the new depend clause. (p91)
Initial error model support through cancel and cancellation point constructs to request cancellation of specified region types and to declare a user-defined cancellation point to check for cancellation requests. (Section 2.13, p116: Cancellation Constructs)
Support for array sections in C, C++ and Fortran. (Section 2.4, p36: Array Sections)
Extended declare simd directive to allow multiple declarations. (p64)
New environment variable OMP_DISPLAY_ENV instructing the runtime to display the OpenMP version number and ICV values during initialization. (p219)
Additional enhancements to support Fortran 2003.
As we were not yet able to incorporate all the feedback that has been reported so far, a few know issues are still in the document. Additionally, some more minor changes are already in preparation. Feedback and questions are of course still welcomed, so head over to http://openmp.org/wp/openmp-specifications/ and download the new document.
Exascale machines will employ significantly more threads than today, but even on current architectures controlling thread affinity is crucial to fuel all the cores and to maintain data affinity, but both MPI and OpenMP lack a solution to this problem – this is the first sentence of our IWOMP 2012 paper with the same title as this blog post. The need for thread affinity in OpenMP has been demonstrated several times at several occasions. Inside the OpenMP Language Committee we formed the Affinity Subcommittee and we are working on this topic since several years now. Meanwhile almost all vendors have introduced their own extensions to support thread affinity, but they are all different and thus offer a clearly suboptimal user experience. Furthermore, they do not support nested OpenMP and in general they are static, meaning that only one affinity setting can be used for the whole program. For OpenMP 4.0, which is expected to be released as a draft in November 2012, we have a good thread affinity proposal on the table that not only standardizes existing vendor extensions, but also will add additional capabilities. This blog post will present this proposal along with some information why things are the way the are. I welcome any comments or questions via email.
When we started thinking about Affinity in general, we first tried to define a machine model or rather a machine abstraction and intended to use that to bind threads to cores as well as to possibly define a data layout. Over time I got convinced that this is not the right approach. Whatever method we used to describe the machine topology, we always envisioned systems that would be very complicated to be described. But furthermore, describing the system could end up being a task to be performed by the user, which I think is too complicated for most of them. We also do not want to enforce users to think about an explicit mapping of threads to cores, because for 95 % of the OpenMP programmers we think this is too low level. And last but not least, when there would be a new machine that could not be comfortably described by our method, OpenMP develops too slowly to be extended to support that.
To overcome this problem, the current proposal as developed by Alexandre E. Eichenberger, myself and the members of the OpenMP Language Committee Affinity Subcommittee, introduced the concepts of a place and a place-list. A place is defined as a set of execution units capable of executing OpenMP threads. For now you may think of a place like a set of cores. A place-list is an ordered list of places, the ordered attribute is important. It can be defined by either using abstract names or rather constructing the places by enumerating the cores. The place-list will be used together with an affinity policy to bind the OpenMP threads in a team of a parallel region to the places in the list. It can be specified via the new environment variable OMP_PLACES (the name might still change). Lets illustrate that with an example: The figure below depicts a very standard system (node 0) with two sockets (socket 0 and socket 1), every socket having four cores (core 0 to core 3 on socket 0) and finally every core has two hardware-threads (t0 and t1), i.e. every core can execute two threads simultaneously.
Lets construct a place-list consisting of eight places, every place to be a physical core consisting of two hardware-threads (I often call those logical-threads). All of the following methods are equivalent, but we expect almost all users to use the first option:
As for now we will define three abstract names to describe the place-list: hwthreads, cores and sockets. It is up to the implementation to define what is meant to be a “core” for instance, but of course we will provide some hints. The wording on that is not yet completed, but it will be something along the lines of hwthreads := smallest unit of execution capable of executing an OpenMP thread; cores := set of execution units in which more than one hardware-thread share some resources such as caches; sockets := physical package of multiple cores.
Of course defining a place-list does not lead to any thread affinity. As I said above, the place list is just used to define the places the threads of a parallel region can be bound to. In our proposal, the user does not have to define an explicit mapping of threads to places (or execution units in a place) – instead, the user can specify a so-called affinity policy via the new affinity clause which can be put on a parallel region. Our proposal consists of currently three affinity policies that allow to exploit the place-list in several possible ways (the names might still change):
SPREAD: spread OpenMP threads as evenly as possible among the places. The place-list will be partitioned, so that subsequent threads (i.e. nested OpenMP) will only be allocated within the partition. Given the place-list outlined above, this policy would provide most dedicated hardware resources to the OpenMP program.
CLOSE: pack OpenMP threads near to the master thread. There is no partitioning. Given the place-list from above, this policy would be used if sharing of resources among threads is desirable.
MASTER: collocate OpenMP threads with the master thread (in the same place). This will ensure maximum locality to the master thread.
It is important to understand that these affinity policies influence the allocation of threads to places – not directly to the system topology. In my example the (ordered!) place-list was designed so that two threads far apart from each other also end up on physical cores far apart in the system. Although we expect this to be the standard use case, it does not necessarily have to be this way.
Lets take a closer look at what the affinity policies do by looking at some examples. The figure below shows what SPREAD will do. The green box denotes the place-list, and for every number of threads >=2 the place-list will be partitioned when a parallel region with this affinity clause is encountered. This will support nested OpenMP, as we will see later on. Every thread will receive its own sub-place-list. If there are more threads than places, more than one thread has to be allocated per place. This will occur so that if threads i and i+1 are put together in one place, this will also be the case for the OpenMP thread ids i and i+1 (in this example with 16 threads: threads with OpenMP thread id 0 and 1 are on place 0).
Lets also take a brief look at the two other affinity policies we are proposing, namely CLOSE and MASTER. Both are exampled in the figure below. For CLOSE, threads i and i+1 are meant to reside on place j and j+1, unless more than one thread will be allocated per place. For MASTER, all threads will be put into the same place the master thread is running on, unless this cannot be fulfilled by the implementation for any reason.
When discussing the proprietary support offered by OpenMP implementers, I said that their solutions are static for the whole program lifetime. In our proposal the initial place-list is fixed, but the affinity policy might of course be set dynamically. Furthermore, the figure below shows how nested OpenMP is supported. The outer parallel region uses the SPREAD affinity policy to create partitions and to maximize resource usage. The inner parallel region uses CLOSE to stay within the respective partition.
Whenever a new feature is intended to go into the OpenMP specification, we require the existence of at least one reference implementation to not only prove implementability, but also to get an estimation of the effort it takes to be implemented. The reference implementation for this proposal was done by Alexandre E. Eichenberger in an experimental OpenMP runtime for the IBM BlueGene/Q system. Our proposal does not affect performance critical parts of the implementation, “just” the thread selection and allocation parts. According to Alexandre’s findings the total overhead was less than 1 %, which is in the order of system noise.
Finally, let me summarize a few important properties / implications that I did not discuss in detail so far:
If the place-list is constructed by enumerating the cores, it will be done with the same naming scheme as used by the operating system. This approach is also used by all vendor-proprietary extensions and removes the need to define an explicit naming scheme, which might confuse users if it is different from the operation system and also might become inappropriate for future system topologies that we would not foresee today.
Every implementation will provide a default place-list to an OpenMP program. It has to document what the default place-list is. I guess that implementations will provide something like cores or hwthreads as a default. This corresponds to the behavior that the number of threads to be used if not specified by the user is also implementation defined (some implementations use just 1 thread, others as many as there are cores in the system).
When one (or more) threads are allocated to a place, they are allowed to migrate within this place if it contains more than one execution unit (i.e. physical core). This will allow for both an explicit thread-to-core binding as well as a more flexible as threads to a socket, for example, depending on how the place-list is constructed as well as which affinity policy is used.
The binding of the initial thread may occur as early as the runtime decides to be appropriate, but not later than when the first parallel region is encountered.
Thanks for reading until down here. More details can be found in the paper which is published by Springer in IWOMP 2012. Again, I welcome any comments or questions via email.
You might have heard it already: The next incarnation of the OpenMP specification, which is targeted to be released as version 3.1 around June in time for IWOMP 2011 in Chicago, has been published as a Draft for Public Comment. You may think of it as beta.
Back in October 2009, I already commented on some of the goals for versions 3.1 and 4.0. OpenMP 3.1 addresses some issues found in the 3.0 specification and brings only minor functional improvements, still it will be released with a delay of almost one year to our initially planned schedule. However, work on version 4.0 already made some significant progress, including support for accelerators (GPUs), further enhancements to the tasking model, and support for error handling. Taking the outline of my previous post on the development of OpenMP, this is the list of updates to be found in OpenMP 3.1 and the status of the development towards OpenMP 4.0 (expressed in my own words and stating my own beliefs and opinions):
1: Development of an OpenMP Error Model. There is nothing new on this topic in OpenMP 3.1. However, with respect to OpenMP 4.0, the so-called done directive has been discussed for quite some time already. It can be used to terminate the execution of a Parallel Region, or a Worksharing construct, or a Task construct, and it is a prominent candidate for the next OpenMP spec. It would provide necessary functionality towards full-featured error handling capabilities, for which there is no good proposal that could be agreed upon yet.
2: Interoperability and Composability. There is nothing new on this topic in OpenMP 3.1. We made several experiments, gained some insights, and the goal is to come up with a set of reliable expectations and assertions in the OpenMP 4.0 timeframe.
3: Incorporating Tools Support into the OpenMP Specification. There is currently no activity on this topic in the OpenMP Language Committee in general.
4: Associating Computation or Memory across Workshares. There is little progress in this direction to be found in OpenMP 3.1. The environment variable OMP_PROC_BIND has been added to control the binding of threads to processors, it accepts a boolean value. If enabled, the OpenMP runtime is instructed to not move OpenMP threads between processors. The mapping of threads to processors is unspecified and thus depends on the implementation. In general, introducing this variable that controls program-wide behavior was intended to standardize behavior found in almost all current OpenMP implementations.
5: Accelerators, GPUs and More. While there is nothing new on this topic in OpenMP 3.1, the Accelerator subcommittee put a lot of effort into coming up with a first (preliminary!) proposal. This is clearly interesting. From my personal point of view, OpenMP 4.0 might provide basic support for programming accelerators such as GPUs, thus delivering a vendor-neutral standard. Do not expect anything full-featured similar to CUDA, the current proposal is rather similar in spirit to the PGI Accelerator approach (which I do like). However, this is still far from being done, and may (or may not) change directions completely. The crucial aspects are to integrate well with the rest of OpenMP, and to provide an easy to use but still powerful approach to allow for bringing certain important code patterns to accelerator devices.
6: Transactional Memory and Thread Level Speculation. There is in general no activity on this topic in the OpenMP Language Committee and apparently it dropped from the set of important topics. Personally, (now) I do not think TM should be a target for OpenMP in the forseable future.
7: Refinements to the OpenMP Tasking Model. There have been some improvements to the Tasking model, with some more on the roadmap for OpenMP 4.0.
The taskyield directive has been added to allow for user-defined task scheduling (tsp) points. A tsp is a point in the execution of a task at which is can be suspended to be resumed later; or the event of task completion, after which the executing thread may switch to a different task.
The mergeable clause has been added to the list of possible task clauses, indicating that the task may have the same data region as the generating task region.
The final clause has been added to the list of possible task clauses, denoting the execution of all descending tasks sequentially in the same region. This implies immediate execution of final tasks, and ignoring any untied task clauses. An optional scalar expression allows for conditioning the application of the final clause.
8: Extending OpenMP to C++0x and FORTRAN 2003. There is nothing new on this topic in OpenMP 3.1. We closely follow the development of the base language and it has to be seen what can (or has to) be done for OpenMP 4.0. Anyhow, the fact that base languages are introducing threading and a thread-aware memory model leads to some simplifications on the one hand, but also could lead to potential conflicts on the other hand. We are not aware of any such conflict, but digging through the details and implification of a base language such as C++ as well as OpenMP is a pretty complex task.
9: Extending OpenMP to Additional Languages. There is nothing new on this topic in OpenMP 3.1, and currently there is no intention of doing so inside the OpenMP Language Committee. Personally, I would like to see an OpenMP binding for Java, since it would really help teaching parallel programming, but I do not see this happen.
10: Clarifications to the Existing Specifications. There have been plenty of clarification, corrections, and micro-updates. Most notably the examples and description in the appendix have been corrected, clarified, and expanded.
11: Miscellaneous Extensions. A couple of miscellaneous extensions made it into OpenMP 3.1:
The atomic construct has been extended to accept the following new clauses: read, write, update and capture. If none is given, it defaults to update. Specifying an atomic region allows to atomically read / write / update the value of the variable affected by the construct. Note that not everything inside an atomic region is performed atomically, i.e. the evaluation of “other” variables is not. For example in an atomic write construct, only the left-hand variable (the one that is written to) is written atomically.
The firstprivate clause now accepts const-qualified types in C/C++ as well as intent(in) in Fortran. This is just a reaction to annoyances reported by some users.
The reduction clause has been extended to allow for min and max reductions for built-in datatypes in C/C++. This still excludes aggregate types (including arrays) as well as pointer and reference types from being used in an OpenMP reduction. We had a proposal for powerful user-defined reductions (UDRs) on the table for a long time, it was discussed heavily, but did not make it into OpenMP 3.1. That would have made this release of the spec much stronger. Adding UDRs is a high priority for OpenMP 4.0 for many OpenMP Language Committee members, though.
omp_in_final() is as new API routine to determine whether it is called from within a final (aka included) task region.
12: Additional Task / Threads Synchronization Mechanisms. There is nothing new on this topic in OpenMP 3.1, and not much interest in the OpenMP Language Committee that I have noticed. However, we are thinking of task dependencies and task reductions for OpenMP 4.0, and both feature would probably fall into this category (and then there would be a high interest).
5th International Workshop on OpenMP (IWOMP 2009) in Dresden, Germany. The IWOMP workshop series focuses on the development and usage of OpenMP. This year’s conference is titled Evolving OpenMP in an Age of Extreme Parallelism – I think this phrase is a but funny, but nevertheless one can clearly observe a trend towards Shared-Memory parallelization on the node of even the extremely parallel machines. Attached to the conference is a two day meeting of the OpenMP language committee. The language committee is currently discussing a long list of possible items for a future OpenMP 3.1 or 4.0 specification, including but not limited to my favorites Composability (especially for C++) and Performance on cc-NUMA system. Bronis de Supinski, the recently appointed Chair of the OpenMP Language Committee, will give a talk on the current activities of the LC and how the future of OpenMP might look like – I hope the slides will be made public soon after the talk. Right before the conference there will also be a one day tutorial for all people interested in learning OpenMP (mainly given by Ruud van der Pas – strongly recommended).
High Performance Computing Symposium 2009 (HPCS) in Kingston, Canada. HPCS is a multidisciplinary conference that focuses on research involving High Performance Computing and this year it takes place in Kingston. I’ve never been at that conference series, so I am pretty curious how it will look like. Attached to the conference are a couple of workshops, including Using OpenMP 3.0 for Parallel Programming on Multicore Systems – run again by Ruud van der Pas and us, and Parallel Programming in Visual Studio 2008 on Windows HPC Server 2008 – organized by us as well. Here in Aachen, the interest in our Windows-HPC compute service is still growing fine and thus we have usually around 50 new participants in our bi-yearly training events. The HPCVL people asked explicitly to cover parallel programming on Windows in the OpenMP workshop, so we separated this aspect out without further ado to serve it well. The workshop program can be found here.
International Supercomputing Conference (ISC 2009) in Hamburg, Germany. ISC titles itself as Europe’s premier HPC event – while this is probably true it is of course smaller than the SC events in the US, but usually better organized. Without question you will find numerous interesting exhibits and can listen to several talks (mostly by invited speakers), so please excuse the self-marketing of me pointing to the Jülich Aachen Research Alliance (JARA) booth in the research space where we will show an interactive visualization of large-scale numerical simulation (damage of blood cells by a ventricular device – pretty cool) as well as give an overview of our research activities focused on Shared-Memory parallelization (we will distribute OpenMP syntax references again). If you are interested in HPC software development on Windows, feel invited to stop by at our demo station at the Microsoft booth where we will have many demos regarding HPC Application Development on Windows (Visual Studio, Allinea DDTlite and Vampir are confirmed, maybe more …). And if you are closely monitoring the HPC market, you have probably heard about ScaleMP already, the company aggregating multiple x86 system into a single (virtual) system over InfiniBand – obviously very interesting for Shared-Memory parallelization. If you are interested, you can hear about our experiences with this architecture for HPC.