Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon - PowerPoint PPT Presentation

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James Price, Tim Mattson and Benedict Gaster under the "attribution CC BY" creative commons license.

Consider N-dimensional domain of work-items • Global Dimensions: – 1024x1024 (whole problem space) • Local Dimensions: – 128x128 ( work-group , executes together) 1024 Synchronization between work- items possible only within work-groups: barriers and memory fences 1024 Cannot synchronize between work-groups within a kernel Synchronization: when multiple units of execution (e.g. work-items) are brought to a known point in their execution. The most common example is a barrier … i.e. all units of execution “in scope” arrive at the barrier before any are allowed to proceed. 2

Simple parallel reduction • A reduction can be carried out in three steps: 1. Each work-item sums its private values into a local array indexed by the work- item’s local id 2. When all the work-items have finished, one work-item sums the local array into an element of a global array (indexed by work- group id). 3. When all work-groups have finished the kernel execution, the global array is summed on the host. • Note: this is a simple reduction that is straightforward to implement. More efficient reductions do the work-group partial reductions in parallel on the device rather than on the host. These more scalable reductions are considerably more complicated to implement. 4

Work-Item Synchronization Ensure correct order of memory operations to local or global memory (with flushes or queuing a memory fence) • Within a work-group: void barrier() – Takes optional flags CLK_LOCAL_MEM_FENCE and/or CLK_GLOBAL_MEM_FENCE – A work-item that encounters a barrier() will wait until ALL work-items in its work-group reach the barrier() – Corollary: If a barrier() is inside a branch, then the branch must be uniform , i.e. taken by either: • ALL work-items in the work-group, OR • NO work-item in the work-group • Between different work-groups: – No guarantees as to where and when a particular work-group will be executed relative to other work-groups – Cannot exchange data, or have barrier-like synchronization between two different work-groups! (Critical issue!) – Only solution : finish executing the kernel and start executing another 5

Tree Reduction • Perform multiple rounds of binary reduction on local memory • Mask or exclude threads at each round of reduction • Still need to reduce across work-group results in global memory https://devblogs.nvidia.com/faster-parallel-reductions-kepler/ 6

A simple program that uses a reduction Numerical Integration Mathematically, we know that we 4.0 can approximate the integral as a sum of rectangles. 2.0 Each rectangle has width and height at the middle of interval. 1.0 0.0 X 7

Numerical integration source code The serial Pi program static long num_steps = 100000; double step; void main() { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i = 0; i < num_steps; i++) { x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; } 8

Looking for Inspiration? • NVIDIA’s OpenCL SDK site includes multiple different implementations of parallel reduction, with varying levels of optimization for GPU: https://developer.nvidia.com/opencl 9

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon - PowerPoint PPT Presentation

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James Price, Tim Mattson and Benedict Gaster under the "attribution CC BY" creative commons license. Consider N-dimensional domain of

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and

PERFORMANCE CONSIDERATIONS FOR OPENCL ON NVIDIA GPUS Karthik Raghavan Ravi, 4/4/16 THE PROBLEM

Accelerating Tandem MS Protein Database Searches Using OpenCL Programming devices the

Scalable Multi-Precision Simulation of Spiking Neural Networks on GPU with OpenCL Dmitri

Experiences with OpenCL in PyFR: 2014Present F.D. Witherden 1 and P.E. Vincent 2 1 Department

Han Dong Dibyajyoti Ghosh Fahad Zafar Shujia Zhou Motivation Explore OpenCL in accelerating

Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1

Input Space Splitting for OpenCL Simon Moll, Johannes Doerfert, Sebastian Hack Saarbrcken

SpeedLight: Synchronized Network Snapshots Nofel Yaseen , John Sonchack, Vincent Liu 1 Network

SU(2)CS and SU(2NF) hidden symmetries L. Ya. Glozman Institut f ur Physik, FB Theoretische

Constraints, and how to satisfy them: Symmetry & Search Colva M. Roney-Dougal University of

Symmetries in the three-Higgs-doublet model Igor Ivanov IFPA, University of Li` ege, Belgium

Phase and Timing Synchronization Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of

1 Transition from Blocked to Runnable Entering the Blocked state A blocked thread moves into the

Principles of Software Construction: Concurrency, Pt. 3 java.util.concurrent Josh Bloch

CPL 2016, week 2 Inter-thread synchronization: locks and monitors Oleg Batrashev Institute of

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon - PowerPoint PPT Presentation

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James Price, Tim Mattson and Benedict Gaster under the "attribution CC BY" creative commons license. Consider N-dimensional domain of

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and

PERFORMANCE CONSIDERATIONS FOR OPENCL ON NVIDIA GPUS Karthik Raghavan Ravi, 4/4/16 THE PROBLEM

Accelerating Tandem MS Protein Database Searches Using OpenCL Programming devices the

Scalable Multi-Precision Simulation of Spiking Neural Networks on GPU with OpenCL Dmitri

Experiences with OpenCL in PyFR: 2014Present F.D. Witherden 1 and P.E. Vincent 2 1 Department

Han Dong Dibyajyoti Ghosh Fahad Zafar Shujia Zhou Motivation Explore OpenCL in accelerating

Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1

Input Space Splitting for OpenCL Simon Moll, Johannes Doerfert, Sebastian Hack Saarbrcken

SpeedLight: Synchronized Network Snapshots Nofel Yaseen , John Sonchack, Vincent Liu 1 Network

SU(2)CS and SU(2NF) hidden symmetries L. Ya. Glozman Institut f ur Physik, FB Theoretische

Constraints, and how to satisfy them: Symmetry &amp; Search Colva M. Roney-Dougal University of

Symmetries in the three-Higgs-doublet model Igor Ivanov IFPA, University of Li` ege, Belgium

Phase and Timing Synchronization Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of

1 Transition from Blocked to Runnable Entering the Blocked state A blocked thread moves into the

Principles of Software Construction: Concurrency, Pt. 3 java.util.concurrent Josh Bloch

CPL 2016, week 2 Inter-thread synchronization: locks and monitors Oleg Batrashev Institute of

Constraints, and how to satisfy them: Symmetry & Search Colva M. Roney-Dougal University of