How to kill OpenMP by 2011 ?!

When I was asked to give an answer to the question of How to kill OpenMP by 2011 during the OpenMP BoF panel discussion at SC08, I decided against listing the most prominent issues and challenges OpenMP is facing. It turned out that the first two speakers – Tim Mattson from Intel and Bronis de Supinski from LLNL – did exactly this very well. Instead, my claim is that OpenMP is doing quite well today and we “just” have to continue in riding on the multi-core momentum by outfitting OpenMP with a little more features. Our group is pretty involved in the OpenMP community and I got the feeling that since around early 2008 OpenMP is gaining moment and I tried to present this in an entertaining approach. This is a brief textual summary of my panel contribution (please do not take all things too seriously).

RWTH Aachen University is a member of the OpenMP ARB (= Architecture Review Board), as OpenMP is very important for many of our applications: All large codes (in terms of compute cycle consumption) are hybrid today, and in order to server some complex applications for which no MPI parallelization exists (so far) we offer the largest SPARC- and x86-based SMP systems one could buy. Obviously we would be very sad if OpenMP would disappear, but in order to find an answer for the question what an university could do to kill OpenMP 2011 it just needed a few domestic beers and a good chat with friends at one of the nice pubs in Austin, TX: Go teaching goto-based spaghetti style programming, as branching in and out of Parallel Regions is not allowed by OpenMP and as such this programming style is inherently incompatible with OpenMP.

By the next day this idea hat lost some of it’s fascination :-), so I went off to evaluate OpenMP’s current momentum. In 2007, we have been invited to write a chapter for David Bader’s book on Petascale Computing. What we did just recently was to do a keyword search (with some manual postprocessing):

Algorithms and Applications, by David Bader.
Petascale Computing: Algorithms and Applications, by David Bader.

Keyword

Hits

MPI

612

OpenMP

150

Thread

109

Posix-Threads

2

UPC

30

C++

87

Fortran

69

Chapel

49

HPF

11

X10, Fortress, Titaium

< 10

This reveals at least the following interesting aspects:

  • MPI is clearly assessed to be the most important programming paradigm for Petascale Systems, but OpenMP is also well-recognized. Our own chapter on how to exploit SMP building blocks contributed for only 28 of the 150 hits on OpenMP.
  • The term Thread is often used in conjunction with OpenMP, but other threading models are virtually not touched at all.
  • C/C++ and Fortran are the programming languages considered to be used to program current and future Petascale systems.
  • There was one chapter on Chapel and because of that it had a comparably high number of hits, but otherwise the “new” parallel programming paradigms are not (yet ?) considered to be significant.

In order to take an even closer look at the recognition of OpenMP we asked our friend Google:

OpenMP versus Native Threading.
Google Trends: OpenMP versus Native Threading.

One can cleary see that the interest in OpenMP is increasing, as opposite to Posix-Threads and Win32-Threads. At the end of 2007 there is a peak when OpenMP 3.0 was announced and a draft standard for public comment was released. Since Q3/2008 we have compilers supporting OpenMP 3.0 which is accounting for increasing interest again. As there is quite some momentum in OpenMP it is hard for us, representing a University / the community it is hard of not impossible to kill OpenMP – which is actually quite nice.

But going back to finding an answer for the question posed on us, we found a suitable assassin: The trend of making Shared-Memory Systems more and more complex in terms of the architecture. For example all current x86-based systems (as announced this week at SC08) are cc-NUMA systems if you have more than one socket, and maybe we will eventually see NUMA (= non-uniform cache architecture) systems as well. So actually the hardware vendors have a chance to kill OpenMP by designing systems that are hard to exploit efficiently with multithreading. So the only chance to really kill OpenMP by 2011 is leaving it as it is and not equipping it with means to aid the programmer in squeezing performance out of such systems with an increasing depth of the memory hierarchy.  In terms of OpenMP, the world is still flat:

The World is still flat, no support for cc-NUMA (yet)!
OpenMP 3.0: The World is still flat, no support for cc-NUMA (yet)!

OpenMP is hardware agnostic, it has no notion of data locality.

The Affinity problem: How to maintain or improve the nearness of threads and their most frequently used data.

Or:

  • Where to run threads?

  • Where to place data?

4 thoughts on “How to kill OpenMP by 2011 ?!”

  1. I couldn’t agree more with the “world is flat” part. The compiler ISVs are putting in some extensions (Intel has kmP_Affinity, PathScale has stride, etc) – but I would expect this to be part of the SPEC itself, not left for the compiler vendors.

    There also are many things the runtime libs need to implement to make different OpenMP constructs more scalable (at large, for very high core-counts) and for NUMA specifically; this is, of-course, vendor specific, and I believe Intel is making nice progress on this front.

  2. I cannot see any reason why current OpenMP does not fit in NUMA model…This has to dow tih how OS allocates resources and manage native threads, on top of which OpenMP operates.

  3. Hey Dakaz.

    With OpenMP, the programmer explicitly states where the program has to go parallel and how the work has to be distributed onto the participating threads. But, as you stated, where the threads are running and where the data is allocated is decided by the operating system (OS). In order to exploit the full performance of a cc-NUMA system, data and thread placement has to be under control. This can be done – but then the program becomes OS-specific, and as the first OpenMP specification was motived by making shared memory parallelization directives vendor-independent this feels just not right to me. Furthermore, the Tasking introduced with OpenMP 3.0 is great, but the programmer has no control of the mapping of tasks to threads (-> data), thus the tricks of improving performance on cc-NUMA architectures become hard to apply (if not impossible).

    Addressing the whole cc-NUMA issue is on the agenda for the next OpenMP specification…

    Kind regards,
    Christian

    PS: The ‘tricks’ are: Make the OS apply first-touch allocation, bind your threads, initialized that data in parallel and use the same access pattern in the computation, do page migration if needed and supported by the OS.

Comments are closed.