Affinity in OpenMP 6.0 on taskloop construct

During last week’s SC24 conference in Atlanta, GA, I briefly reported on the activity of the Affinity subcommittee of the OpenMP language committee. One topic was that, together with the Tasking subcommittee, we brought support for taskloop affinity to OpenMP 6.0, which I am going to describe here.

As you are probably well aware, the OpenMP specification currently allows for the use of the depend and affinity clauses on task constructs. The depend clause provides a mechanism for expressing data dependencies among tasks, and the affinity clause functions as a hint to guide the OpenMP runtime where to execute the tasks, preferably close to the data items specified in the clause. However, this functionality was not made available when the taskloop construct was added, which parallelizes a loop by creating a set of tasks, where each task typically handles one or more iterations of the loop. Specifically, the depend clause could not be used to express dependencies, either between tasks within a taskloop or between tasks generated by a taskloop and other tasks, limiting its applicability.

OpenMP 6.0 introduced the task_iteration directive, which, when used with a taskloop construct, allows for fine-grained control over the creation and properties of individual tasks within the loop. Each task_iteration directive within a taskloop signals the creation of a new task with corresponding properties. With this functionality, one can express:

  • Dependencies: The depend clause on a task_iteration directive allows to specify data dependencies between tasks generated by the taskloop as well between tasks of this taskloop and other tasks (standalone and e.g. generated by other taskloops).
  • Affinity: The affinity clause can be used to specify data affinity for individual tasks. This enables optimizing data locality and improving cache utilization.
  • Conditions: The if clause can be used to conditionally generate tasks within the taskloop. This can be helpful for situations where not all iterations of the loop need to generate a dependency, in particular to reduce overhead.

Let’s consider the following artificial example code.

// TL1 taskloop
#pragma omp taskloop nogroup
for (int i = 1; i < n; i++)
{
   #pragma omp task_iteration depend(inout: A[i]) depend(in: A[i-1])
   A[i] += A[i] * A[i-1];
}


// TL2 taskloop + grainsize
#pragma omp taskloop grainsize(strict: 4) nogroup
for (int i = 1; i < n; i++)
{
   #pragma omp task_iteration depend(inout: A[i]) depend(in: A[i-4])
\
if ((i % 4) == 0 || i == n-1)
   A[i] += A[i] * A[i-1];
}


// T3 other task
#pragma omp task depend(in: A[n-1])

The first taskloop TL1 construct parallelizes a loop that has an obvious dependency: every iteration i depends on the previous iteration i-1. This is expressed with the depend clause accordingly. Consequently, this will manifest in dependencies between tasks generated by this taskloop.

The second taskloop TL2 parallelized the loop by creating tasks that each execute four iterations, because of the grainsize clause with the strict modifier. In addition, a task dependency is only created if the expression of the if clause evaluates to true, limiting the overall number of dependencies per task

The remaining standalone task T3 is a regular explicit task that depends on the final element of array A, that is produced by the last task of TL2, and hence ensures the completion of all previously generated tasks.


Learn OpenMP with BNL’s free online tutorial

In today’s world of multi-core processors and accelerated systems, unlocking the power of parallel programming is no longer a niche skill – it’s a game-changer. BNL’s free online tutorial series on OpenMP allows you to learn OpenMP from wherever you are!

OpenMP is a powerful programming interface that simplifies parallel computing, allowing you to tap into the full potential of your hardware. By effectively parallelizing your applications, you can achieve significant performance gains, saving valuable time and resources.

This six-episode program, already underway, provides a comprehensive and convenient learning experience. Each episode, released monthly, delves into OpenMP concepts in an easy-to-follow format. The tutorial is presented by me and Michael Klemm, with one guest lecture by Ruud van der Pas, and technical information on using OpenMP compilers and tools by Helen He.

Head over to the event website to see the presented material, find links to the recordings available on YouTube, and register for the upcoming episodes. This is the complete agenda:

OpenMP Tutorials at SC22

As in previous years, several OpenMP tutorial proposals have been accepted for SC22. I am really looking forward to being in the USA again, and – among other things – to teach OpenMP to real people, instead of black tiles. In this summary, I would like to highlight the two tutorials in which I am involved.

And by the way: in addition to the content itself, I believe these tutorials provide the extra value of direct access to members of the OpenMP Language Committee. That means we are approachable beyond the tutorial outline to discuss any topics, or any issues, you have with OpenMP.

Mastering Tasking with OpenMP

Since version 3.0 released in 2008, OpenMP offers tasking to support the creation of composable parallel software blocks and the parallelization of irregular algorithms. Mastering the tasking concept of OpenMP requires a change in the way developers reason about the structure of their code and how to expose the parallelism of it. Our tutorial addresses this critical aspect by examining the tasking concept in detail and presenting patterns as solutions to many common problems.

Presenters: Christian Terboven, Michael Klemm, Xavier Teruel and Bronis R. de Supinski

Content summary:

  • OpenMP Overview (high-level summary, synchronization, memory model)
  • OpenMP Tasking Model (overview, data sharing, taskloop)
  • Improving Tasking Performance (if + final + mergeable clauses, cut-off strategies, task dependencies, task affinity)
  • Cancellation Construct
  • Future OpenMP directions

Advanced OpenMP: Host Performance and 5.2 Features

Developers usually find OpenMP easy to learn. However, they are often disappointed with the performance and scalability of the resulting code. This stems not from shortcomings of OpenMP, but rather from the lack of depth with which it is employed. Our “Advanced OpenMP Programming” tutorial addresses this critical need by exploring the implications of possible OpenMP parallelization strategies, both in terms of correctness and performance.

Presenters: Christian Terboven, Michael Klemm, Ruud van der Pas, and Bronis R. de Supinski

Content summary:

  • OpenMP Overview (high-level summary, synchronization, memory model)
  • Techniques to obtain High Performance with OpenMP: memory access (memory placement, binding, NUMA) and vectorization (understanding SIMD, vectorization in OpenMP)
  • Advanced Language Features (doacross loops, user-defined reductions, atomics)
  • Future OpenMP directions

For a complete list of SC22 activities around OpenMP and associated with the OpenMP organization, please see this page listing tutorials, the Bof, and booth talks.

Excellent price-performance of SC20 tutorials

You are probably aware that SC20 will be a virtual (= online) event. It will start in about two weeks with the Tutorials (November 9 to 11), followed by the Workshops (November 11 to 13), the Keynotes and Awards and Top500 (and more, November 16) and finally the Technical Program and Invited Talks (and more, November 17 to 19).

However, the switch to an online format brings a great advantage for the SC20 tutorial format that I only became aware of very recently: Tutorials will be recorded and available online on-demand for 6 months. This will give you the unique chance to attend all tutorials you are possibly interested in!

If you are interested in OpenMP, there are three tutorials to choose from. The OpenMP web presence has a nice overview. As usual, I am part of the Advanced OpenMP: Host Performance and 5.0 Features tutorial. Our focus is on performance aspects, e.g., data/thread locality, false sharing, and exploitation of vector units. All topics are accompanied by case studies and we will discuss the corresponding OpenMP language features in-depth. Please note that we will solely cover performance programming for multi-core architectures (not accelerators):

Title Slide: Advanced OpenMP tutorial at SC20
Our title slide: Advanced OpenMP tutorial at SC20

Personal Experience with Online Teaching

Each summer semester, I teach the lecture “Concepts and Models of Parallel and Data-centric Programming” (abbreviated as PDP) at RWTH Aachen University. The lecture mainly attracts students in the Master programs in Computer Science, Data Science, and some Computational Engineering programs. The decision to shift to online teaching was made about two weeks before the start of the lecture period. However, we had about two weeks of additional time for preparations because we anticipated this move. Since we just received the lecture evaluation of this course, I wanted to use this opportunity to write down some thoughts on my personal experience with my first Online Course.

The moment we realized that the summer semester 2020 would not be a “regular” semester, we started discussions about the format that would be best suited for the expected situation. Here, “we” includes the teaching assistants Simon Schwitanski and Julian Miller, who support me and also Matthias Müller, the head of our institute. Back in March, before the start of the lecture period, we quickly realized that it was hard to predict the developments over the coming weeks and months. But much more importantly, we realized that we had little insight into the situations our students are in, in particular of the international students. In consequence, we decided to maximize the flexibility that we wanted to offer to those students who chose to take the summer semester 2020 seriously. Back then, it was not yet decided that it would become a so-called optional semester in Germany.

We decided to offer a format that I would call “Flipped Classroom Style”, because we borrowed many aspects of Flipped Classroom (FP) as described in the literature, but did not fully implement it. Similarly to FP, the lectures were recorded, and students were expected to watch these videos ahead of time. The scheduled class time was used to provide a brief review of the past video lectures, to answer students’ questions, and to explore selected topics in more detail. I consider the following details of our offer to be the most important and a good summary:

  • The lecture was already well-structured into Foundations and nine Chapters. We decided against giving 90 minutes of lectures as videos. Instead, we split the existing material into pieces of 20 to 40 minutes, producing 55 videos in total. The videos were recorded and provided in the Moodle course room. Students could watch them whenever they wanted, wherever they were, and with whatever effort they were capable of investing. The videos were recorded and produced with open source software (OBS, Kdenlive, Audacity).
  • We expected the students to watch approx. 150 minutes of video per week. The content of these videos was covered in the discussion slots (= the scheduled class time) in the following week. Each discussion slot contained an overview of the material that it was intended to cover, giving us the opportunity the emphasize the most important aspects and to express complicated things again in different words. We did these sessions with Zoom.
  • For each chapter, we provided a Quiz, which is an element in Moodle. A quiz mostly consists of simple recall and comprehension questions. Some quizzes ask the students to apply their knowledge to a simple task.
  • We offered the students to execute the exercises bare-metal on the HPC system CLAIX at RWTH Aachen University, or in a Jupyter environment that was developed in parallel in a different activity (see my first report on IkapP). Different from the lectures, we presented the solutions to the exercises “live” (that means: in real-time on the real system) via Zoom. Of course, we had slides with instructions, code solutions, answers to theoretical questions, experiment details, and so forth available. We provided these slides in the Moodle course room after the live presentation. There was no video recorded for any exercise.

One remark on the effort. We had about 200 students signed up for the lecture in RWTH’s student management system, about 100 students in the Moodle course room, and again about 100 students signed up for the final exam. Let us assume that giving two lectures per week is a time investment of 180 minutes (speaking) plus 90 minutes (preparation) plus a bit of walking between the office and the lecture room. With the format outlined above, I would roughly estimate a factor 2 of additional effort because the videos have to be produced in advance, and the scheduled class time still required the lecturer’s presence. The preparation of new content is more or less the same (and was done before Corona happened).

I was really curious about the evaluation of the course. In summary, it was very good. The German grades go from 1.0 (the best) to 5.0 (the worst). Students who on average invested two to four hours per week in course preparation and follow-up work gave us (here: me) the following grades: 1.3 for both the Lecture Concept and the Exercise Course, and 1.4 for both the Lecture and Exercise Instruction and Behavior.

This year, the questionnaire contained a few new questions with a focus on the Digital Teaching aspect. I believe these questions were asked in the wrong manner to get meaningful insight, but they served the case to validate that the students were paying attention when answering the questionnaire. Here is just an excerpt:

  • Q: My enthusiasm to get involved with the contents of the course increased thanks to the digital teaching materials. A: no. My comment: well, I would not have expected any increase.
  • Q: The interaction between students and lecturers is better than with face-to-face teaching. A: no. My comment: In case of yes I would have been disappointed.
  • Q: The interaction between students is better than in face-to-face teaching. A: certainly not. My comment: in case of yes I would have been disappointed.
  • Q: I prefer face-to-face teaching to digital teaching/courses. A: no. My comment: I hope that students develop a preference for Blended Learning.

What did students say in a free-text form about the advantages of digital teaching for this course? Here are just the most common answers, which means they were given by multiple students independently:

  • Better health (more sleep, no Covid)
  • Lectures /exercises can be paused or re-watched
  • Less effort to participate, no travel times to the lecture/exercise
  • Less barrier to ask questions

To be honest, the last point triggered me to think about the lecture and to write this article. Because according to our subjective impression, we received significantly fewer questions in the lecture and the exercise than in previous instances of the course. Why is that? Today, we got a good answer confirmed by three students: students ask fewer questions because videos simplify to reassess the lecture content.

What did students say in a free-text form what they particularly liked about this course? Many of them put answers about digital teaching into this field. Here is a digital teaching-related excerpt, again with a focus on the points that were brought forward by multiple students:

  • High quality of the provided digital materials, very well-defined structure, self-contained and clear explanations for students
  • Concept of video lectures and weekly Q & A sessions
  • Many small videos rather than a few big videos
  • Great summaries and examples
  • Very good and interactive exercises
  • The quizzes
  • The flipped classroom concept went pretty well

The last point is my summary of the digital teaching concept we applied for this lecture and I believe it worked much better than just switching the lecture session to a video broadcast – in particular for our international students. Meanwhile, we learned that a certain portion of them was not able to return to / stay in Germany (we do not have precise numbers, unluckily), which proved our assumption of what would be best for the students true.

For the next instance of this lecture, I would like to improve the videos. While I hope to be able to teach in a classroom again, I believe that proving some videos would be beneficial, given the feedback of the students. For instance, we could provide videos for those aspects that turned out to deliver the least points in the exam. I am currently preparing the transcription of the videos (probably Microsoft Azure Cognitive Services) to make the explanations a bit more cohesive. And we are thinking about how to address the two points of critics that we received:

  • Not uploading exercise videos (students should participate online because they saw the advantage and benefit of doing so. And not because they otherwise miss the presentations and explanations of the teacher. For example, I was one time not able to participate in the exercise lecture because I had got an exam from last semester postponed because of Corona). – I understand that we are not consistent here, but I explained our motivation for the exercise format above. We will think about that.
  • It would have been better if we could actually have a view of our professor moving. It is quite boring to never see him. – Already for the coming winter semester, we are planning for improvement of the video setup that would allow us to record a separate picture of the lecturer. This was not practical in the home-office situation.

Using C++ with OpenMP in jupyter notebooks

Many people might know jupyter notebooks as an interactive, web-based environment for python programming, often in the context of data analysis. However, if the cling kernel is used, it becomes possible to interactively work with C/C++ code. The so-called kernel is the interpreter used for the evaluation of code cells and cling is an interactive C++ interpreter. This could look like this:

C++ in jupyter notebook

In the IkapP project funded by the Stifterverband and state of North Rhine-Westphalia, one goal is to remove entry barriers faced by students using HPC systems in lectures. One step towards this goal is the creation of a virtual lab environment for parallel programming that can also be used for interactive experiments with different parallelization approaches. Users can, e.g., interactively experience performance results of code changes on real HPC systems. There are many parallel programming models in use and of relevance for our lectures, but we wanted to start with OpenMP and MPI. However, cling does not support OpenMP out of the box.

At the time of this writing, the current version of xeus-cling is 0.8.1, which is not based on a recent version of clang. So in principle, OpenMP Version 3.1 should be supported, which means tasking will be available, but offloading will not be available. OpenMP “in action” in a jupyter notebook could look like this:

C++ with OpenMP in jupyter notebook

In order for such notebooks to work correctly, we had to fix a few things in the xeus-cling code, in particular to ensure correct output from multiple threads. The corresponding patches were created and submitted by Jonas Hahnfeld, a student worker in the IkapP project at RWTH. They have been accepted to mainline (#314, #315, #320, #332, #319, #316, #324, #325 (also submitted to xeus-python) and #329), but since our submission there has been no new release.

Compiling and Installing xeus-cling on CentOS 7.7

The production environment on RWTH’s HPC systems is CentOS 7.7. The build instructions were compiled by Jonas. In order to build xeus-clang for a jupyter environment, do for each of the following projects (in this order):

https://github.com/jarro2783/cxxopts, https://github.com/nlohmann/json, https://github.com/zeux/pugixml, https://github.com/xtensor-stack/xtl, https://github.com/zeromq/libzmq, https://github.com/zeromq/cppzmq, https://github.com/jupyter-xeus/xeus, https://github.com/jupyter-xeus/xeus-cling

$ git clone https://github.com/org/repo src
$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=/path/to/install/xeus-cling/ \
-DCMAKE_C_COMPILER=/path/to/install/xeus-cling/bin/clang \
-DCMAKE_CXX_COMPILER=/path/to/install/xeus-cling/bin/clang++ \  ../src
$ make -j32
$ make install

After that, activate the kernels via

for k in xcpp11 xcpp14 xcpp17; do
cp -r ../../../../xeus-cling/share/jupyter/kernels/$k /path/to/install/jupyter/share/jupyter/kernels/;
done

and add -fopenmp to each kernel.json to enable OpenMP. Finally, let cling find the runtime libraries by adding to jupyterhub_config.py:

c.Spawner.environment = {
  'LD_LIBRARY_PATH': '/path/to/install/xeus-cling/lib/',
}

The Ongoing Evolution of OpenMP

Usually, I do not use this blog to talk directly about my work. I want to make one exception to point to the following article titles The Ongoing Evolution of OpenMP. It appeared online at IEEE and is accessible here: https://ieeexplore.ieee.org/document/8434208/.

From the abstract:
This paper presents an overview of the past, present and future of the OpenMP application programming interface (API). While the API originally specified a small set of directives that guided shared memory fork-join parallelization of loops and program sections, OpenMP now provides a richer set of directives that capture a wide range of parallelization strategies that are not strictly limited to shared memory. As we look toward the future of OpenMP, we immediately see further evolution of the support for that range of parallelization strategies and the addition of direct support for debugging and performance analysis tools. Looking beyond the next major release of the specification of the OpenMP API, we expect the specification eventually to include support for more parallelization strategies and to embrace closer integration into its Fortran, C and, in particular, C++ base languages, which will likely require the API to adopt additional programming abstractions