Deriving Efficient Data Movement From Decoupled Access/Execute - PowerPoint PPT Presentation

The Queen’s Tower The Queen’s Tower Imperial College Imperial College London London South Kensington, South Kensington, SW7 SW7 Deriving Efficient Data Movement From Decoupled Access/Execute Specifications Lee W. Howes, Anton Lokhmotov, Alastair F. Donaldson and Paul H. J. Kelly Imperial College London and Codeplay Software January 2009 Lee Howes 27th Jan 2008 | Ashley Brown

Multi-core architectures • Require parallel programming • Must divide computation • Must communicate data • High-throughput computation – Efficient use of memory bandwidth essential Source: AMD 2 Lee Howes 27th Jan 2008 | Ashley Brown

Cell's hardware solution • Target the memory wall: – Distributed local memories: 256kB each – Separate data movement from computation using DMA engines • Bulk transfers increase efficiency • Increased programming challenge: – Must write data movement code – Must deal with alignment constraints • Premature optimisation – Platform independence is lost Source: IBM 3 Lee Howes 27th Jan 2008 | Ashley Brown

Mainstream programming models • No explicit support for separation of computation from data access • Freely mix computation and data movement • Complexity of compiler analysis => Difficult to extract separation • Orthogonal issues: – extracting parallelism – creating data movement code 4 Lee Howes 27th Jan 2008 | Ashley Brown

The proposal • Allow the programmer to express explicitly: – Separation between data communication and computation – Parallelism of the computation 5 Lee Howes 27th Jan 2008 | Ashley Brown

Streams • Approaches the separation ideal • Simple kernel applied to each element of a data set • Each element of stream typically independent of others – No feedback as a parallel processing model – Dependencies only on input and output elements 6 Lee Howes 27th Jan 2008 | Ashley Brown

Parallelism in stream programming • Independence of executions => simple inference of parallelism • Sliding windows of elements on inputs – access multiple elements – parallelism still predictable • AMD, NVIDIA use a stream model for parallel hardware 7 Lee Howes 27th Jan 2008 | Ashley Brown

Streams? A 2D convolution filter • Reads region of input • Processes region • Writes single point in the output 8 Lee Howes 27th Jan 2008 | Ashley Brown

Representing convolution as 1D streams • One option: flatten 2D dataset – Requires multiple sliding windows or long FIFO structures • Mapping 2D structures to 1D streams is untidy 9 Lee Howes 27th Jan 2008 | Ashley Brown

Representing convolution as 2D streams • Stanford's Brook language uses stencils on 2D shaped streams floats x; floats2 y; streamShape(x,2,32,32); streamStencil(y, x, STREAM_STENCIL_CLAMP, 2, 1, -1, 1, -1); kernel void neighborAvg(floats2 a, out floats b) { b = 0.25*(a[0][1]+a[2][1]+a[1][0]+a[1][2]); } 10 Lee Howes 27th Jan 2008 | Ashley Brown

Representing convolution as 2D streams • Stencil stream passed to kernel • Treated as if it is a small set of accessible elements • Limited addressing capabilities floats x; floats2 y; streamShape(x,2,32,32); streamStencil(y, x, STREAM_STENCIL_CLAMP, 2, 1, -1, 1, -1); kernel void neighborAvg(floats2 a, out floats b) { b = 0.25*(a[0][1]+a[2][1]+a[1][0]+a[1][2]); } 11 Lee Howes 27th Jan 2008 | Ashley Brown

Generalising streams • View streams as: – A kernel, executed separately on each data element – A simple mapping of that kernel onto the data – elementwise or moving windowed • This is a simplistic separation of access from execution, hence the Decoupled Acess/Execute ( Æ cute ) model 12 Lee Howes 27th Jan 2008 | Ashley Brown

Æ cute as a generalisation of streams • Take a similar kernel-per-element declarative programming model • View in terms of an iteration space that is independent of the data sets • With a separate, flexible mapping to the data • Mapping allows clean descriptions of complicated data access patterns • Simpler kernel implementations with localised data sets 13 Lee Howes 27th Jan 2008 | Ashley Brown

Execute • Define an iteration space (e.g. as polyhedral constraints) • Execute a computation kernel for each point in the iteration space 14 Lee Howes 27th Jan 2008 | Ashley Brown

Data access • On each iteration, the kernel accesses a set of data elements • Accessed elements treated as local to the iteration • Eases programming of the kernel 15 Lee Howes 27th Jan 2008 | Ashley Brown

Decoupled access/execute • Decouple access to remote memory from local execution • Separate mapping of local store to global data 16 Lee Howes 27th Jan 2008 | Ashley Brown

Multiple iterations • Decouple access and execute for multiple iterations for efficiency • Manually supporting this flexibility can be challenging 17 Lee Howes 27th Jan 2008 | Ashley Brown

Add in alignment issues • DMAs must be adapted to correct for alignment • Data can often be read with alignment tweaks to fix performance 18 Lee Howes 27th Jan 2008 | Ashley Brown

In code: The iterator Neighbourhood2D_Read inputPointSet(iterationSpace, input, K); Point2D_Write outputPointSet(iterationSpace, output); ... void kernel( const IterationSpace2D::element_iterator &eit ) { // compute mean rgb mean( 0.0f, 0.0f, 0.0f ); for(int w = -K; w <= K; ++w) { for(int z = -K; z <= K; ++z) { mean += inputPointSet(eit, w, z); // input[x+w][y+z] } } outputPointSet( eit ) = mean / ((2K+1)(2K+1)); } 19 Lee Howes 27th Jan 2008 | Ashley Brown

In code: Use of access descriptors Neighbourhood2D_Read inputPointSet(iterationSpace, input, K); Point2D_Write outputPointSet(iterationSpace, output); ... void kernel( const IterationSpace2D::element_iterator &eit ) { // compute mean rgb mean( 0.0f, 0.0f, 0.0f ); for(int w = -K; w <= K; ++w) { for(int z = -K; z <= K; ++z) { mean += inputPointSet(eit, w, z); // input[x+w][y+z] } } outputPointSet( eit ) = mean / ((2K+1)(2K+1)); } 20 Lee Howes 27th Jan 2008 | Ashley Brown

In code: Computation in the kernel Neighbourhood2D_Read inputPointSet(iterationSpace, input, K); Point2D_Write outputPointSet(iterationSpace, output); ... void kernel( const IterationSpace2D::element_iterator &eit ) { // compute mean rgb mean( 0.0f, 0.0f, 0.0f ); for(int w = -K; w <= K; ++w) { for(int z = -K; z <= K; ++z) { mean += inputPointSet(eit, w, z); // input[x+w][y+z] } } outputPointSet( eit ) = mean / ((2K+1)(2K+1)); } 21 Lee Howes 27th Jan 2008 | Ashley Brown

Æ cute iteration spaces • Define an n-dimensional iteration space • Specify sizes for each dimension – can be run time defined • For example: – IterationSpace<2> iSpace( 0, 0, 10, 10 ); • Over which we can iterate using fairly standard syntax: – for( IterationSpace<2>::iterator it = iSpace.begin()..... ){...} • Can treat the iterator loop much as an OpenMP blocked look 22 Lee Howes 27th Jan 2008 | Ashley Brown

Æ cute access descriptors • Define a mapping from an iteration space to an array • Specify shape and mapping functions • For example: – Region2D<Array<rgb,2>,IterationSpace<2>> inputPointSet( iSpace, data, RADIUS ); • Which we can access using an iterator – InputPointSet(it,1,0).r = 3; 23 Lee Howes 27th Jan 2008 | Ashley Brown

Æ cute address modifiers • Base address of a region combines: – iterator address in its iteration space – address modifier function • A modifier, or modifier chain, is applied (optionally) to each access descriptor: – Point2D< Project2D1D< 1, 0 > > inputPointSet( iSpace, data, RADIUS ); – Projects a 2D address into a 1D address to access a 1D dataset 24 Lee Howes 27th Jan 2008 | Ashley Brown

The Æ cute framework • Implementation of the Æ cute model for data movement on the STI Cell processor 25 Lee Howes 27th Jan 2008 | Ashley Brown

Iterating • PPE takes a chunk of the iteration space – Blocking is configurable 26 Lee Howes 27th Jan 2008 | Ashley Brown

Delegation • Transmits chunk to appropriate SPE runtime as a message 27 Lee Howes 27th Jan 2008 | Ashley Brown

Loading data • SPE loads appropriate data for the chunk into an internal buffer in each access descriptor object 28 Lee Howes 27th Jan 2008 | Ashley Brown

Loading data • SPE processes one buffer set while receiving the next block to process 29 Lee Howes 27th Jan 2008 | Ashley Brown

Loading data • DMA loading next buffers operate in parallel with computation 30 Lee Howes 27th Jan 2008 | Ashley Brown

Loading data • On completion of a block, input buffers cleared, output DMAs initiated 31 Lee Howes 27th Jan 2008 | Ashley Brown

Advantages • Separation of buffering maintains simplicity • Double/triple buffering comes naturally when there are no data dependent loads • Remove complexity of manual software pipelining • Complicated addressing schemes not precluded 32 Lee Howes 27th Jan 2008 | Ashley Brown

Deriving Efficient Data Movement From Decoupled Access/Execute - PowerPoint PPT Presentation

The Queens Tower The Queens Tower Imperial College Imperial College London London South Kensington, South Kensington, SW7 SW7 Deriving Efficient Data Movement From Decoupled Access/Execute Specifications Lee W. Howes, Anton

Secure and Efficient Access to Outsourced Data Secure and Efficient Access to Outsourced Data

Decoupled Access/Execute Computer Architectures James E. Smith Presented by Dan Amelang How

Module 1: Introduction Deriving Business Information Deriving meaningful information from

Decoupled I/O for Data-Intensive High Performance Computing Chao Chen 1 Yong Chen 1 Kun Feng 2

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Decoupled Access/Execute Metaprogramming Anton Lokhmotov, Lee Howes, Paul H.J. Kelly (Imperial);

Decoupling Address Generation from Loads and Stores to Improve Data Access Energy Efficiency

A Grassroots Movement for Clean, Efficient Power Solarize! A Grassroots Solar Movement

ACCELERATION VIA EXPLICIT DECOUPLED DATA ORCHESTRATION Michael Pellauer* 1/26/2019 [Extended

Large Scale Data Movement Data Movement Patterns o The right solution depends on the problem

Information about the site we need to take into account Before we start designing a new scheme,

HICAMP: Architectural Support for Efficient Concurrency-Safe Shared Structured Data Access

ARM Assembler Data Movement Beginning Programs p. 1/10 Memory Access Load Register from

Associative Graph Data Structures AGDS with an Efficient Access via AVB+trees Adrian Horzyk AGH

Deriving Filtering Algorithms Deriving Filtering Algorithms from Constraint Checkers from

LEARNED INDEXES: A NEW IDEA FOR EFFICIENT DATA ACCESS ROBERT RODGER - GODATADRIVEN - 12 JUNE

Deriving a Relationship from a Single Example Neil Mitchell community.haskell.org/~ndm/derive

Secrets of the decoupled Drupal practitioner Preston So April 11, 2019 DrupalCon

ACCESS WIRRAL BETTER, FASTER & MORE EFFICIENT CUSTOMER SERVICES WIRRAL PLAN 2020 ACCESS

Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling

Open Access a movement in progress Lars Bjrnshauge Director of Libraries, Lund University,

Residential movement within New Zealand: Quantifying and characterising the transient population

Design Patterns for Large Scale Data Movement Aaron Lee aaron.lee@solacesystems.com Data

Data Movement Instructions Systems Design & Programming CMPE 310 Intel Assembly Data

Deriving Efficient Data Movement From Decoupled Access/Execute - PowerPoint PPT Presentation

The Queens Tower The Queens Tower Imperial College Imperial College London London South Kensington, South Kensington, SW7 SW7 Deriving Efficient Data Movement From Decoupled Access/Execute Specifications Lee W. Howes, Anton

Secure and Efficient Access to Outsourced Data Secure and Efficient Access to Outsourced Data

Decoupled Access/Execute Computer Architectures James E. Smith Presented by Dan Amelang How

Module 1: Introduction Deriving Business Information Deriving meaningful information from

Decoupled I/O for Data-Intensive High Performance Computing Chao Chen 1 Yong Chen 1 Kun Feng 2

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Decoupled Access/Execute Metaprogramming Anton Lokhmotov, Lee Howes, Paul H.J. Kelly (Imperial);

Decoupling Address Generation from Loads and Stores to Improve Data Access Energy Efficiency

A Grassroots Movement for Clean, Efficient Power Solarize! A Grassroots Solar Movement

ACCELERATION VIA EXPLICIT DECOUPLED DATA ORCHESTRATION Michael Pellauer* 1/26/2019 [Extended

Large Scale Data Movement Data Movement Patterns o The right solution depends on the problem

Information about the site we need to take into account Before we start designing a new scheme,

HICAMP: Architectural Support for Efficient Concurrency-Safe Shared Structured Data Access

ARM Assembler Data Movement Beginning Programs p. 1/10 Memory Access Load Register from

Associative Graph Data Structures AGDS with an Efficient Access via AVB+trees Adrian Horzyk AGH

Deriving Filtering Algorithms Deriving Filtering Algorithms from Constraint Checkers from

LEARNED INDEXES: A NEW IDEA FOR EFFICIENT DATA ACCESS ROBERT RODGER - GODATADRIVEN - 12 JUNE

Deriving a Relationship from a Single Example Neil Mitchell community.haskell.org/~ndm/derive

Secrets of the decoupled Drupal practitioner Preston So April 11, 2019 DrupalCon

ACCESS WIRRAL BETTER, FASTER &amp; MORE EFFICIENT CUSTOMER SERVICES WIRRAL PLAN 2020 ACCESS

Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling

Open Access a movement in progress Lars Bjrnshauge Director of Libraries, Lund University,

Residential movement within New Zealand: Quantifying and characterising the transient population

Design Patterns for Large Scale Data Movement Aaron Lee aaron.lee@solacesystems.com Data

Data Movement Instructions Systems Design &amp; Programming CMPE 310 Intel Assembly Data

ACCESS WIRRAL BETTER, FASTER & MORE EFFICIENT CUSTOMER SERVICES WIRRAL PLAN 2020 ACCESS

Data Movement Instructions Systems Design & Programming CMPE 310 Intel Assembly Data