Using multi-devices with OpenMP

By Roger Ferrer & Xavier Teruel – Barcelona Supercomputing Center (BSC)

Cloud services enhance business performance through agile deployment, robust security, efficient data management, and utility-based sharing models. They offer a low total cost of ownership and high-performance computing. Cloud computing supports quicker adaptation and smoother operations in dynamic market environments, simplifying technology adoption.

Data centres are crucial for cloud computing. Data centres also need to be energy efficient and currently they are built using systems with many GPUs that help accelerating a number of workloads suitable for computationally dense applications.

A common challenge when programming for systems featuring many GPUs is making the most of them. Application developers rely on programming models able to exploit them. Typically, algorithms that expose data-parallelism are well suited for acceleration. One programming model that helps developers in using GPUs to accelerate their applications is OpenMP.

OpenMP is a directive-based programming model born in the context of High-Performance Computing (HPC). In contrast to other programming models that require using new APIs, OpenMP only expects developers to annotate relevant parts of their applications that can be accelerated. OpenMP is able to exploit different kinds of parallelism, not just data-parallelism, such as task-parallelism.

Since its version 4.0, OpenMP provides directives that allow offloading parts of the applications to an accelerator, leveraging both task- and data-parallelism. Task-parallelism in this context roughly corresponds to the offload process itself and data-parallelism exploits the parallel execution of the accelerator. In the context of GPUs, exploiting data-parallelism is key to achieve good performance and speeding up applications. OpenMP, when it comes to accelerators, is not linked to a specific vendor making it portable across different technologies.

OpenMP, is flexible, and can be used in systems built using nodes where each node has many GPUs, such as the ones in data centres as stated above. However, the programming model does not provide enough expressiveness in these scenarios. For instance, the programmer needs to explicitly distribute data and execution over the different GPUs in the node. We can see this in the left-hand side of the Figure 1. The programmer needs to explicitly compute the bounds of the different chunks of the data-parallel loop so to distribute it among the different GPUs. Computing these bounds adds to the burden of accelerating an application and is not flexible in more complex scenarios such as those exposing imbalance.

Figure 1: Source code comparison: manual work distribution (left) and target spread approach (right).

At BSC, within the RISER project, we have been working on providing an easier way to program systems with several accelerators. In particular, a scenario with homogeneous accelerators, like systems with many GPUs found in data centres.

To achieve that, we have extended OpenMP with a new directive that we call spread. We can see an example in the right-hand side of Figure 1. In it, spread extends the target directive when applied to a parallel loop. The user can now specify a set of devices using a new devices clause. For each device, the programmer can also use special variables, such as omp_spread_start and omp_spread_size which will contain the boundaries of the different chunks offloaded to the accelerator. These variables, implemented by the compiler, help mapping the data to the different devices.

Our goal here is not necessarily better performance. Instead we focus on improving the expressiveness and productivity of a programmer that wants to make use of the many GPUs systems.

Currently, we have a prototype implementation of the spread directive on top of Clang/LLVM compiler infrastructure. We are evaluating its usefulness and investigating the performance impact in codes that benefit from this new directive. The selection of codes being studied includes numerical applications, machine learning, cryptography and image processing. These codes are representative of the applications typically run on cloud platforms.

The RISER accelerated platform presents itself as a system with many accelerators. The spread directive provides a tool to exploit this scenario as well, thanks to the portable nature of OpenMP.