Transparent CPU-GPU Collaboration for Data-Parallel Kernels on - PDF document

Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems Janghaeng Lee, Mehrzad Samadi, Yongjun Park and Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan - Ann Arbor, MI Email: { jhaeng, mehrzads, yjunpark, mahlke } @umich.edu More recently, systems are configured with several different Abstract — Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles types of processing devices, such as CPUs with integrated data parallel work by taking advantage of its massive number GPUs and multiple discrete GPUs for higher performance. of cores while the CPU handles non data-parallel work, such as However, as most data-parallel applications are written to the sequential code or data transfer management. Unfortunately, target a single device, other devices will likely be idle, this work distribution can be a poor solution as it under utilizes which results in underutilization of the available computing the CPU, has difficulty generalizing beyond the single CPU-GPU combination, and may waste a large fraction of time transferring resources. One solution to improve the utilization is to asyn- data. Further, CPUs are performance competitive with GPUs chronously execute data-parallel kernels on both CPUs and on many workloads, thus simply partitioning work based on GPUs, which enables each device to work on an independent the fixed roles may be a poor choice. In this paper, we present kernel [4]. Unfortunately, applications that launch multiple the single kernel multiple devices (SKMD) system, a framework independent kernels are rare and require programmer effort that transparently orchestrates collaborative execution of a single data-parallel kernel across multiple asymmetric CPUs and GPUs. to ensure there are no inter-kernel data dependences. When The programmer is responsible for developing a single data- dependences cannot be eliminated, the default execution model parallel kernel in OpenCL, while the system automatically of one kernel at a time must be used. partitions the workload across an arbitrary set of devices, To alleviate this problem, several prior works have proposed generates kernels to execute the partial workloads, and efficiently the idea of splitting threads of a single data-parallel kernel merges the partial outputs together. The goal is performance across multiple devices [21], [14], [12]. Luk et al. [21] improvement by maximally utilizing all available resources to execute the kernel. SKMD handles the difficult challenges of proposed the Qilin system that automatically partitions threads exposed data transfer costs and the performance variations to CPUs and GPUs by providing new APIs. However, Qilin GPUs have with respect to input size. On real hardware, SKMD only works for two devices (one CPU and one GPU), and achieves an average speedup of 29% on a system with one the applicable data parallel kernels are limited by usage of multicore CPU and two asymmetric GPUs compared to a fastest the APIs, which requires access locations of all threads to be device execution strategy for a set of popular OpenCL kernels. analyzed statically. Kim et al. [14] proposed the illusion of a Index Terms —GPGPU, OpenCL, Collaboration, Data parallel single compute device image for multiple equivalent GPUs. Although they improved the portability by using OpenCL as their input language, their work also puts several constraints on I. INTRODUCTION the types of kernels in order to benefit from multiple equivalent Heterogeneous computing that combines traditional pro- GPUs. For example, the access locations of each thread must cessors (CPUs) with graphic processing units (GPUs) has have regular patterns, and the number of threads must be a become the standard in most systems from cell phones to multiple of the number of GPUs. servers. GPUs achieve higher performance by providing a Despite individual successes, the majority of data parallel massively parallel architecture with hundreds of relatively kernels still cannot benefit from multiple computing devices simple cores while exposing parallelism to the programmer. due to strict limitations on the underlying hardware and the By leveraging new programming models, such as OpenCL [13] type of data-parallel kernels. As hardware systems are con- and CUDA [1], programmers are able to effectively develop figured with more than two computing devices and more sci- highly threaded data-parallel kernels to execute on the GPUs. entific applications have been converted to more complicated Meanwhile, CPUs also provide affordable performance on OpenCL/CUDA data-parallel kernels in order to benefit from data-parallel applications armed with higher clock-frequency, heterogeneous architectures, these limitations become more low memory access latency, an efficient cache hierarchy, significant. To overcome these limitations, we have identified single-instruction multiple-data (SIMD) units, and multiple three central challenges that must be solved to effectively cores. With these hardware characteristics, many studies have utilize multiple computing devices: been done to improve the performance of data-parallel kernels Challenge 1: Data-parallel kernels with irregular mem- on both CPUs and GPUs [18], [26], [3], [7], [10], [8], [5]. ory access patterns are hard to partition over multiple

Transparent CPU-GPU Collaboration for Data-Parallel Kernels on - PDF document

Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems Janghaeng Lee, Mehrzad Samadi, Yongjun Park and Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan - Ann Arbor, MI Email: { jhaeng,

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

GPU Construction and Transparent GPU Construction and Transparent Rendering of Iso-Surfaces

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

"On the Efficacy of a Fused CPU + GPU Processor (or APU) for Parallel Computing"

Lets Fix OpenGL Adrian Sampson, Cornell Commands Pixels CPU GPU Display CPU Display

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems Abdullah Gharaibeh, Elizeu

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

Department of Geological Sciences Backcalculation of Intelligent Compaction Data for the

Location- -free Routing in free Routing in Location Sensor Networks Explore the Global

Deep Underground Neutrino Experiment (DUNE) 1 Technical Proposal 2 Volume n: Sample for Overleaf

Image Smoothing ! Chicken-and-egg dilemma! " ! Edge preserving image smoothing !

What America Is Thinking On Energy Issues Production & Infrastructure: Pennsylvania

Monetary Policy Report December 2018 Chapter 1 Figure 1.1. Repo rate with uncertainty bands

From Oil Fields to Hilbert Schemes Lorenzo Robbiano Universit di Genova Dipartimento di

Anytime Approximate Inference in Graphical Models Qi Lou Final Defense Dec. 5, 2018 Committee:

Transparent CPU-GPU Collaboration for Data-Parallel Kernels on - PDF document

Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems Janghaeng Lee, Mehrzad Samadi, Yongjun Park and Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan - Ann Arbor, MI Email: { jhaeng,

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

GPU Construction and Transparent GPU Construction and Transparent Rendering of Iso-Surfaces

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

&quot;On the Efficacy of a Fused CPU + GPU Processor (or APU) for Parallel Computing&quot;

Lets Fix OpenGL Adrian Sampson, Cornell Commands Pixels CPU GPU Display CPU Display

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems Abdullah Gharaibeh, Elizeu

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

Department of Geological Sciences Backcalculation of Intelligent Compaction Data for the

Location- -free Routing in free Routing in Location Sensor Networks Explore the Global

Deep Underground Neutrino Experiment (DUNE) 1 Technical Proposal 2 Volume n: Sample for Overleaf

Image Smoothing ! Chicken-and-egg dilemma! &quot; ! Edge preserving image smoothing !

What America Is Thinking On Energy Issues Production &amp; Infrastructure: Pennsylvania

Monetary Policy Report December 2018 Chapter 1 Figure 1.1. Repo rate with uncertainty bands

From Oil Fields to Hilbert Schemes Lorenzo Robbiano Universit di Genova Dipartimento di

Anytime Approximate Inference in Graphical Models Qi Lou Final Defense Dec. 5, 2018 Committee:

"On the Efficacy of a Fused CPU + GPU Processor (or APU) for Parallel Computing"

Image Smoothing ! Chicken-and-egg dilemma! " ! Edge preserving image smoothing !

What America Is Thinking On Energy Issues Production & Infrastructure: Pennsylvania