Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali
5 Generations of TI Multicore Processors • Keystone architecture – Lowers development effort – Speeds time to market – Leverages TI’s investment – Optimal software reuse 2
TI 66AK2H12 SoC • Keystone II architecture • Cores – 4 ARM A15s at 1.0 GHz • 4 MB shared L2 cache • 32 G flops/s single precision and 8 G flops/s double precision – 8 C66x DSPs at 1.0 GHz • 64 kB L1 scratch / cache each • 1 MB L2 scratch / cache each • 128 G flops/s single precision and 32 G flops/s double precision • Memory – 8 GB DDR3 DRAM (external) – 6 MB SRAM shared • Interfaces – 2x Gigabit Ethernet ~ 100 MB/s – 4x SRIO ~ 400 MB/s – 2x Hyperlink ~ 1 GB/s 3
Development Philosophy User view TI or user provided • User view acceleration – Embedded Linux running on the ARM – Standard GCC tool chain Library API – Simply link to a TI provided library with an ARM callable API to accelerate applications using ARM ARM multiple ARM cores, DSP cores and processors 1 4 as appropriate OpenMP – Use TI provided tools and examples to write new applications and libraries which use multiple OpenCL ARM cores, DSP cores and processors to accelerate performance DSP DSP • Using multiple cores on a single processor 1 8 – OpenMP for shared memory parallelization across ARM cores Processor 1 – OpenCL or OpenMP Accelerator for Open MPI heterogeneous acceleration with multiple DSP cores • Using multiple processors – Open MPI over Ethernet, SRIO or Hyperlink Processor 180 4
ARM + OpenCL DSP Acceleration TI 66AK2H12 TI 66AK2H12 ARM subsystem ARM subsystem OpenMP OpenMP ARM 0 ARM 1 ARM 2 ARM 3 ARM 0 ARM 1 ARM 2 ARM 3 OpenCL OpenCL OpenMP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 DSP subsystem DSP subsystem Data parallel OpenCL + OpenMP regions - A kernel is enqueued - A task is enqueued - OpenCL divides into N workgroups - OpenCL dispatches the task to DSP 0 - Each workgroup is assigned a core - Tasks can use additional DSP cores by - After all workgroups fin ish a new kernel can be entering OpenMP regions dispatched - A task completes before another task is dispatched Task parallel - Note: This is a TI extension - A task is enqueued - OpenCL dispatches tasks to cores Example use - OpenCL can accept and dispatch more tasks - Want to call existing OpenMP based DSP code 5 asynchronously from the ARM
ARM + OpenMP Accelerator DSP Acceleration // OpenMP Accelerator vector add TI 66AK2H12 // OpenMP for loop parallelization ARM subsystem void ompVectorAdd(int N, OpenMP float *a, float *b, ARM 0 ARM 1 ARM 2 ARM 3 float *c) OpenMP Accelerator { OpenMP #pragma omp target \ map(to: N, a[0:N], b[0:N]) \ map(from: c[0:N]) DSP DSP DSP DSP DSP DSP DSP DSP 0 1 2 3 4 5 6 7 { DSP subsystem int i; #pragma omp parallel for Data movement for (i = 0; i < N; i++) - to copies variables from the ARM memory to c[i] = a[i] + b[i]; } the DSP memory } - from copies variables from the DSP memory to the ARM memory - TI provides special alloc and free functions to allocate DSP memory such that copies are not needed Calling existing DSP code from the ARM - Wrapping existing DSP functions with OpenMP 6 Accelerator code is straightforward
Memory • Shared memory visible by both the TI 66AK2H12 ARM subsystem ARM and DSP 4 MB ARM shared memory – A portion of the 8GB DDR3 DRAM (external) ARM 0 ARM 1 ARM 2 ARM 3 – The 6MB SRAM shared memory 6 MB ARM and DSP shared memory • 8 GB Performance keys DRAM DSP subsystem – Allocate data in the shared memory for 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB ARM setup and DSP acceleration L2 L2 L2 L2 L2 L2 L2 L2 – Use clmalloc() to allocate contiguous 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB L1 L1 L1 L1 L1 L1 L1 L1 blocks that can be efficient transferred DSP DSP DSP DSP DSP DSP DSP DSP using DMA 0 1 2 3 4 5 6 7 • Options – Let the tools take care of the data movement using assign workgroup and strided copy functions – Manually manage the data movement using DMA (e.g., define buffers available for the DSP in OpenCL and manage the actual data movement on the DSP) 7
Dense Linear Algebra Philosophy 8
BLIS Cortex-A15 DGEMM Multicore Performance • Peak performance: 9.6 GFLOPS • DGEMM performance is ~ 8.4 GFLOPS (83% peak)) 9
Recall - Memory How can we improve this performance? TI 66AK2H12 ARM subsystem 4 MB ARM shared memory • The BLIS implementation on the DSP does not ARM 0 ARM 1 ARM 2 ARM 3 utilize the different levels of memory efficiently. 6 MB ARM and DSP shared memory 8 GB DRAM • Utilize the DMA (Direct Memory Access) DSP subsystem 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB capabilities of the DMA to move data in parallel L2 L2 L2 L2 L2 L2 L2 L2 to the computations 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB L1 L1 L1 L1 L1 L1 L1 L1 DSP DSP DSP DSP DSP DSP DSP DSP 0 1 2 3 4 5 6 7 10
Cache Exploitation and DMA 11
Cache Exploitation and DMA Details 12
DMA Integration Goals • Flexible User or library developer must be able to select when and where to transfer data for an operation • Transparent User must not be aware of the usage of the DMA, but if desired can manage the DMA • Integrated into the control tree mechanism 13
Algorithmic Variants for GEMM 14
GEMM Control Tree Definitions 15
Algorithmic Variants for GEMM with DMA Integration 16
GEMM Control Tree Definitions with DMA Integration 17
Memory Buffers 18
Current Status of DMA Integration in GEMM • Implemented multithreaded prototype of DMA Control Tree with decoding in Block Variant 1 using memcpy instead of DMA • Pending – Decoding of DMA Control Tree in other variants – Invoking DMA routines 19
Thank you! A special thanks to Tyler M. Smith Field G. Van Zee Robert van de Geijn
Recommend
More recommend