C++0x: OpenMP loop parallelization without pragmas?!

Some people are complaining that OpenMP’s approach of using pragmas to annotate a program is not very nice, as pragmas / the OpenMP directives are not well-integrated into the language. Personally, I like the OpenMP approach and think it has some specific advantages. But I am also very interested in researching how the OpenMP language bindings could be improved, especially for C++. This post is about using C++0x features to build parallelization constructs that have been praised in the context of other approaches (e.g. Parallel Extensions for C#, or Intel’s Threading Building Blocks), but using OpenMP constructs.

Let’s consider the following sequential loop which is very similar to the example used in the Microsoft Parallel Extensions to the .NET Framework 3.5 (June 2008) documentation:

01   double dStart, dEnd;
02   for (int rep = 0; rep < iNumRepetitions; rep++)
03   {04       dStart = omp_get_wtime();
05       for (int i = 0; i < iNumElements; i++)
06       {
07           vec[i] = compute(vec[i], iNumIterations);
08       }
09       dEnd = omp_get_wtime();10   }

The experiment loop (line 05 to line 08) is executed iNumRepetitions times, the time is taken in line 04 and 09 using OpenMP time measurement functions (portability!), and the time required for each element can be controlled via iNumIterations. I will use that parametrization for my performance experiments – for now let’s just look at how this would be parallelized in OpenMP:

#pragma omp parallel for shared(iNumElements, vec, iNumIterations)
        for (int i = 0; i < iNumElements; i++)
            vec[i] = compute(vec[i], iNumIterations);

Pretty straight forward – as this parallel loop is perfectly balanced, we do not need the schedule clause here. How could that loop look like without using pragmas? Maybe as shown here:

omp_pfor (0, iNumElements, [&](int i)
    vec[i] = compute(vec[i], iNumIterations);

Do you like that? No OpenMP pragma is visible in the user’s code, he just has to specify the loop iteration space and the loop variable, the parallelization is done “under the hood”. The implementation of this is pretty simple using lambda functions of C++0x:

template<typename F>
void omp_pfor(int start, int end, F x)
#pragma omp parallel for
    for(int __i = start; __i < end; __i++)

Of course I am still using OpenMP directives here, but they are hidden as an implementation detail. The actual loop body is passed as an argument to the omp_pfor lambda function, as well as the loop boundaries. Please note that this is just a very simple example, of course one can handle all types of loops that are currently supported in OpenMP 3.0 (any maybe even more) and STL-type algorithms!

In this post I only talked about syntax, but there is more to it. A part of my research is looking into how programmers (especially from the background of computation engineering science in Aachen) can be provided with more powerful language-based tools to ease writing parallel and reusable code / components. I am always happy to discuss on such a topic – if you like the Live Space comment functionality as little as I do, just drop me a mail at christian@terboven.com.

You can download this example code from my website. In order to compile that code, I recommend using the latest Intel 11.0 beta compiler.