Optimizing Explicit Data Transfers for Data Parallel Applications on Heterogeneous Multi-core Platforms S. Saidi 1 , 2 P.Tendulkar 1 T. Lepley 2 O. Maler 1 1 Verimag 2 STMicroelectronics Hipeac 2012 Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 1 / 56
Outline Introduction 1 Optimal Granularity 2 One Processor Multiple Processors Shared Data Transfers 3 Experiments on the CELL Architecture 4 Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 2 / 56
Introduction Outline Introduction 1 Optimal Granularity 2 One Processor Multiple Processors Shared Data Transfers 3 Experiments on the CELL Architecture 4 Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 3 / 56
Introduction Motivation How to reduce/hide the off-chip memory latency? Multi−core fabric Host CPU PE n PE 0 ... Memory Memory Interconnect Off−chip Memory Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 4 / 56
Introduction Heterogeneous Multi-core Architectures a powerful host processor and a multi-core fabric to accelerate computationally heavy kernels. Multi−core fabric Host CPU PE n PE 0 ... Memory Memory Interconnect Off−chip Memory Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 5 / 56
Introduction Heterogeneous Multi-core Architectures a powerful host processor and a multi-core fabric to accelerate computationally heavy kernels. Multi−core fabric T 2 T 0 Host CPU PE n PE 0 T 1 ... Memory Memory Interconnect Off−chip Memory Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 5 / 56
Introduction Data Transfers Offloadable kernels work on large data sets, initially stored in the off-chip memory. Algorithm for i = 0 to n − 1 Y [ i ] = f ( X [ i ]) od Multi−core fabric T 0 PE 0 Host CPU PE n ... Memory Memory Interconnect Off−chip Memory .... X .... Y Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 6 / 56
Introduction Data Transfers High off-chip memory latency: accessing off-chip data is very costly Algorithm for i = 0 to n − 1 Y [ i ] = f ( X [ i ]) od Read Multi−core fabric T 0 PE 0 Host CPU PE n ... Memory Memory Interconnect Off−chip Memory Write .... X .... Y Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 7 / 56
Introduction Data Transfers Data is transferred to a closer but smaller on-chip memory, using DMAs (Direct Memory Access). Algorithm for i = 0 to n − 1 Y [ i ] = f ( X [ i ]) od Multi−core fabric T 0 PE 0 Host CPU PE n ... block 0 Data Block Transfers Memory Memory block 1 Interconnect ... Off−chip Memory X .... .... Y Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 8 / 56
Introduction DMA Data Transfers s : number of array elements in one block, block m − 1 block 0 block 1 block m − 2 X X [0] X [1] X [ n − 1] ... s n ���������������� ���������������� ����������������� ����������������� i = 0 Fetch(block i ) while ( i < n / s ) i + + Compute(block i ) Write back(block i ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 9 / 56
Introduction DMA Data Transfers s : number of array elements in one block, block m − 1 block 0 block 1 block m − 2 X X [0] X [1] X [ n − 1] ... s n ���������������� ���������������� ����������������� ����������������� i = 0 dma get(local-buffer, block i , s) Fetch(block i ) while ( i < n / s ) i + + Compute(block i ) Write back(block i ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 10 / 56
Introduction DMA Data Transfers s : number of array elements in one block, block m − 1 block 0 block 1 block m − 2 X X [0] X [1] X [ n − 1] ... s n ���������������� ���������������� ����������������� ����������������� i = 0 Fetch(block i ) while ( i < n / s ) i + + Compute(block i ) Compute(block i ) Write back(block i ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 11 / 56
Introduction DMA Data Transfers s : number of array elements in one block, block m − 1 block 0 block 1 block m − 2 X X [0] X [1] X [ n − 1] ... s n ���������������� ���������������� ����������������� ����������������� i = 0 Fetch(block i ) while ( i < n / s ) i + + Compute(block i ) dma put(block i , local-buffer, s) Write back(block i ) Write back(block i ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 12 / 56
Introduction DMA Data Transfers s : number of array elements in one block, block m − 1 block 0 block 1 block m − 2 X X [0] X [1] X [ n − 1] ... s n ���������������� ���������������� ����������������� ����������������� i = 0 dma get(local-buffer, block i , s) Fetch(block i ) Sequential execution of computations and data transfers. while ( i < n / s ) i + + Compute(block i ) Write back(block i ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 13 / 56
Introduction DMA Data Transfers s : number of array elements in one block, block m − 1 block 0 block 1 block m − 2 X X [0] X [1] X [ n − 1] ... s n ���������������� ���������������� ����������������� ����������������� i = 0 Fetch(block i ) Sequential execution of computations and data transfers. while ( i < n / s ) i + + Compute(block i ) Compute(block i ) Write back(block i ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 14 / 56
Introduction DMA Data Transfers s : number of array elements in one block, block m − 1 block 0 block 1 block m − 2 X X [0] X [1] X [ n − 1] ... s n ���������������� ���������������� ����������������� ����������������� i = 0 Fetch(block i ) Sequential execution of computations and data transfers. while ( i < n / s ) i + + Compute(block i ) dma put(block i , local-buffer, s) Write back(block i ) Write back(block i ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 15 / 56
Introduction Double Buffering Asynchronous DMA calls and double buffering: dma get ( local − buffer [1] , block 0 , s ) Fetch(block 0 ) i = 0 dma get ( local − buffer [2] , block i +1 , s ) Compute(block i ) Fetch(block i +1 ) while ( i < ( n / s ) − 1) i + + Write back(block i ) Compute(block ( n / s ) − 1 ) Write back(block ( n / s ) − 1 ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 16 / 56
Introduction Double Buffering Asynchronous DMA calls and double buffering: dma get ( local − buffer [1] , block 0 , s ) Fetch(block 0 ) Fetch(block 0 ) i = 0 dma get ( local − buffer [2] , block i +1 , s ) Compute(block i ) Fetch(block i +1 ) while ( i < ( n / s ) − 1) i + + Write back(block i ) Compute(block ( n / s ) − 1 ) Write back(block ( n / s ) − 1 ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 17 / 56
Introduction Double Buffering Asynchronous DMA calls and double buffering: dma get ( local − buffer [1] , block 0 , s ) Fetch(block 0 ) i = 0 dma get ( local − buffer [2] , block i +1 , s ) Compute(block i ) Compute(block i ) Fetch(block i +1 ) Fetch(block i +1 ) while ( i < ( n / s ) − 1) i + + Write back(block i ) Compute(block ( n / s ) − 1 ) Write back(block ( n / s ) − 1 ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 18 / 56
Introduction Double Buffering Asynchronous DMA calls and double buffering: dma get ( local − buffer [1] , block 0 , s ) Fetch(block 0 ) i = 0 dma get ( local − buffer [2] , block i +1 , s ) Compute(block i ) Fetch(block i +1 ) while ( i < ( n / s ) − 1) i + + Write back(block i ) Write back(block i ) Compute(block ( n / s ) − 1 ) Write back(block ( n / s ) − 1 ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 19 / 56
Introduction Double Buffering Asynchronous DMA calls and double buffering: dma get ( local − buffer [1] , block 0 , s ) Fetch(block 0 ) i = 0 dma get ( local − buffer [2] , block i +1 , s ) Compute(block i ) Compute(block i ) Fetch(block i +1 ) Fetch(block i +1 ) while ( i < ( n / s ) − 1) i + + Write back(block i ) Compute(block ( n / s ) − 1 ) Write back(block ( n / s ) − 1 ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 20 / 56
Introduction Double Buffering Asynchronous DMA calls and double buffering: Overlap of computations and data transfers. dma get ( local − buffer [1] , block 0 , s ) Fetch(block 0 ) i = 0 dma get ( local − buffer [2] , block i +1 , s ) Compute(block i ) Compute(block i ) Fetch(block i +1 ) Fetch(block i +1 ) while ( i < ( n / s ) − 1) i + + Write back(block i ) Compute(block ( n / s ) − 1 ) Write back(block ( n / s ) − 1 ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 21 / 56
Recommend
More recommend