Affinity in OpenMP 6.0 on taskloop construct

During last week’s SC24 conference in Atlanta, GA, I briefly reported on the activity of the Affinity subcommittee of the OpenMP language committee. One topic was that, together with the Tasking subcommittee, we brought support for taskloop affinity to OpenMP 6.0, which I am going to describe here.

As you are probably well aware, the OpenMP specification currently allows for the use of the depend and affinity clauses on task constructs. The depend clause provides a mechanism for expressing data dependencies among tasks, and the affinity clause functions as a hint to guide the OpenMP runtime where to execute the tasks, preferably close to the data items specified in the clause. However, this functionality was not made available when the taskloop construct was added, which parallelizes a loop by creating a set of tasks, where each task typically handles one or more iterations of the loop. Specifically, the depend clause could not be used to express dependencies, either between tasks within a taskloop or between tasks generated by a taskloop and other tasks, limiting its applicability.

OpenMP 6.0 introduced the task_iteration directive, which, when used with a taskloop construct, allows for fine-grained control over the creation and properties of individual tasks within the loop. Each task_iteration directive within a taskloop signals the creation of a new task with corresponding properties. With this functionality, one can express:

  • Dependencies: The depend clause on a task_iteration directive allows to specify data dependencies between tasks generated by the taskloop as well between tasks of this taskloop and other tasks (standalone and e.g. generated by other taskloops).
  • Affinity: The affinity clause can be used to specify data affinity for individual tasks. This enables optimizing data locality and improving cache utilization.
  • Conditions: The if clause can be used to conditionally generate tasks within the taskloop. This can be helpful for situations where not all iterations of the loop need to generate a dependency, in particular to reduce overhead.

Let’s consider the following artificial example code.

// TL1 taskloop
#pragma omp taskloop nogroup
for (int i = 1; i < n; i++)
{
   #pragma omp task_iteration depend(inout: A[i]) depend(in: A[i-1])
   A[i] += A[i] * A[i-1];
}


// TL2 taskloop + grainsize
#pragma omp taskloop grainsize(strict: 4) nogroup
for (int i = 1; i < n; i++)
{
   #pragma omp task_iteration depend(inout: A[i]) depend(in: A[i-4])
\
if ((i % 4) == 0 || i == n-1)
   A[i] += A[i] * A[i-1];
}


// T3 other task
#pragma omp task depend(in: A[n-1])

The first taskloop TL1 construct parallelizes a loop that has an obvious dependency: every iteration i depends on the previous iteration i-1. This is expressed with the depend clause accordingly. Consequently, this will manifest in dependencies between tasks generated by this taskloop.

The second taskloop TL2 parallelized the loop by creating tasks that each execute four iterations, because of the grainsize clause with the strict modifier. In addition, a task dependency is only created if the expression of the if clause evaluates to true, limiting the overall number of dependencies per task

The remaining standalone task T3 is a regular explicit task that depends on the final element of array A, that is produced by the last task of TL2, and hence ensures the completion of all previously generated tasks.


OpenMP Tutorials at SC23

Similar to previous years, I am involved in two OpenMP-related tutorials at SC23 in Denver. This year, we produced two short videos outlining the content of these tutorials.

Our Advanced OpenMP (see the SC23 program page) tutorial with the subtitle Performance and 5.2 Features focuses on explaining how to achieve performance on modern HPC architectures and presenting the latest features of OpenMP 5.x. This half-day tutorial will be given on Monday afternoon and has the following content:

SC23 Tutorial Overview: Advanced OpenMP

The focus of our Mastering Tasking (see the SC program page) tutorial is to teach and cover all aspects of task parallelism in OpenMP with many code examples. This half-day tutorial will be given on Monday morning and has the following content:

SC23 Tutorial Overview: Mastering Tasking

Given our backgrounds in OpenMP development, in the past instances of this tutorial, we used the breaks and discussion time to answer all the questions people ever had on OpenMP. Really, you are invited to ask us anything :-).

OpenMP in Small Bites (online tutorial with quizzes)

As a member of the hpc.nrw regional network, I have recorded 10 video sessions for an online OpenMP tutorial. Each part consists of a short video on one selected aspect of OpenMP, followed by a couple of quiz questions for self-control. The tutorial has been designed to be platform-independent and to work with every operating system with an OpenMP-compatible compiler available. However, my examples are limited to C/C++.

All material is provided under a Creative Commons license. The topics that are currently available are:

Overview

This part provides a brief history of OpenMP and then introduces the concept of the parallel region: find it here.

Worksharing

This part introduces the concept of OpenMP worksharing, loop scheduling, and the first synchronization mechanisms: find it here.

Data Scoping

This part provides an overview of one of the most challenging parts (well, probably at first sight) of OpenMP: data scoping. It discusses the differences between private, firstprivate, lastprivate and shared variables and also explains the reduction operation: find it here.

False Sharing

This part explains the concept of caches in parallel computer architectures, discusses the problem of false sharing, and shows how to avoid it: find it here.

Tasking

This part introduces task parallelism in OpenMP. This concept enables the programmer to parallelize code regions with non-canonical loops or regions which do not use loops at all (including recursive algorithms): find it here.

Tasking and Data Scoping

This part deepens the knowledge of OpenMP task parallelism and data scoping by using an artificial example: find it here.

Tasking and Synchronization

This session discusses different synchronization mechanisms for OpenMP task parallelism: find it here.

Loops and Tasks

This part presents the taskloop construct in OpenMP: find it here.

Task Scheduling

This part explains how task scheduling works in OpenMP: find it here.

Non-Uniform Memory Access

This part explains a non-uniform memory access (NUMA) architecture may influence the performance of OpenMP programs. It illustrates how to distribute data and bind threads across NUMA domains and how to avoid uncontrolled data or thread migration: find it here.

What is missing? Please let me know which aspects of OpenMP you would like to see covered in one of the next small bites. Just to let you know, some parts on GPU programming with OpenMP are already in preparation and will hopefully be released in the next lecture-free period.

Excellent price-performance of SC20 tutorials

You are probably aware that SC20 will be a virtual (= online) event. It will start in about two weeks with the Tutorials (November 9 to 11), followed by the Workshops (November 11 to 13), the Keynotes and Awards and Top500 (and more, November 16) and finally the Technical Program and Invited Talks (and more, November 17 to 19).

However, the switch to an online format brings a great advantage for the SC20 tutorial format that I only became aware of very recently: Tutorials will be recorded and available online on-demand for 6 months. This will give you the unique chance to attend all tutorials you are possibly interested in!

If you are interested in OpenMP, there are three tutorials to choose from. The OpenMP web presence has a nice overview. As usual, I am part of the Advanced OpenMP: Host Performance and 5.0 Features tutorial. Our focus is on performance aspects, e.g., data/thread locality, false sharing, and exploitation of vector units. All topics are accompanied by case studies and we will discuss the corresponding OpenMP language features in-depth. Please note that we will solely cover performance programming for multi-core architectures (not accelerators):

Title Slide: Advanced OpenMP tutorial at SC20
Our title slide: Advanced OpenMP tutorial at SC20

Webinar: Using OpenMP Tasking

With the increasing prevalence of multi-core processors, shared-memory programming models are essential. OpenMP is a popular, portable, widely supported and easy-to-use shared-memory model. Since version 3.0, released in 2008, OpenMP offers tasking to support the creation of composable parallel software blocks and the parallelization of irregular algorithms. However, the tasking concept requires a change in the way developers reason about the structure of their code and hence expose the parallelism of it. In this webinar, we will give an overview about the OpenMP tasking language features and performance aspects, such as introducing cut-off mechanisms and exploiting task dependencies.

The recording from the webinar is now available here: https://youtu.be/C8ekL2x4hZk.

Webinar: Getting Performance from OpenMP Programs on NUMA Architectures

Most contemporary shared memory systems expose a non-uniform memory architecture (NUMA) with implications on application performance. However, the OpenMP programming model does not provide explicit support for that. This 30-minute live webinar will discuss the approaches to getting the best performance from OpenMP applications on NUMA architecture.

The recording from the webinar is now available here: https://pop-coe.eu/blog/2nd-pop-webinar-getting-performance-from-openmp-programs-on-numa-architectures.