Integrating DMA capabilities into BLIS for on-chip data movement - PowerPoint PPT Presentation

Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

5 Generations of TI Multicore Processors • Keystone architecture – Lowers development effort – Speeds time to market – Leverages TI’s investment – Optimal software reuse 2

TI 66AK2H12 SoC • Keystone II architecture • Cores – 4 ARM A15s at 1.0 GHz • 4 MB shared L2 cache • 32 G flops/s single precision and 8 G flops/s double precision – 8 C66x DSPs at 1.0 GHz • 64 kB L1 scratch / cache each • 1 MB L2 scratch / cache each • 128 G flops/s single precision and 32 G flops/s double precision • Memory – 8 GB DDR3 DRAM (external) – 6 MB SRAM shared • Interfaces – 2x Gigabit Ethernet ~ 100 MB/s – 4x SRIO ~ 400 MB/s – 2x Hyperlink ~ 1 GB/s 3

Development Philosophy User view TI or user provided • User view acceleration – Embedded Linux running on the ARM – Standard GCC tool chain Library API – Simply link to a TI provided library with an ARM callable API to accelerate applications using ARM ARM multiple ARM cores, DSP cores and processors 1 4 as appropriate OpenMP – Use TI provided tools and examples to write new applications and libraries which use multiple OpenCL ARM cores, DSP cores and processors to accelerate performance DSP DSP • Using multiple cores on a single processor 1 8 – OpenMP for shared memory parallelization across ARM cores Processor 1 – OpenCL or OpenMP Accelerator for Open MPI heterogeneous acceleration with multiple DSP cores • Using multiple processors – Open MPI over Ethernet, SRIO or Hyperlink Processor 180 4

ARM + OpenCL DSP Acceleration TI 66AK2H12 TI 66AK2H12 ARM subsystem ARM subsystem OpenMP OpenMP ARM 0 ARM 1 ARM 2 ARM 3 ARM 0 ARM 1 ARM 2 ARM 3 OpenCL OpenCL OpenMP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 DSP subsystem DSP subsystem Data parallel OpenCL + OpenMP regions - A kernel is enqueued - A task is enqueued - OpenCL divides into N workgroups - OpenCL dispatches the task to DSP 0 - Each workgroup is assigned a core - Tasks can use additional DSP cores by - After all workgroups fin ish a new kernel can be entering OpenMP regions dispatched - A task completes before another task is dispatched Task parallel - Note: This is a TI extension - A task is enqueued - OpenCL dispatches tasks to cores Example use - OpenCL can accept and dispatch more tasks - Want to call existing OpenMP based DSP code 5 asynchronously from the ARM

ARM + OpenMP Accelerator DSP Acceleration // OpenMP Accelerator vector add TI 66AK2H12 // OpenMP for loop parallelization ARM subsystem void ompVectorAdd(int N, OpenMP float *a, float *b, ARM 0 ARM 1 ARM 2 ARM 3 float *c) OpenMP Accelerator { OpenMP #pragma omp target \ map(to: N, a[0:N], b[0:N]) \ map(from: c[0:N]) DSP DSP DSP DSP DSP DSP DSP DSP 0 1 2 3 4 5 6 7 { DSP subsystem int i; #pragma omp parallel for Data movement for (i = 0; i < N; i++) - to copies variables from the ARM memory to c[i] = a[i] + b[i]; } the DSP memory } - from copies variables from the DSP memory to the ARM memory - TI provides special alloc and free functions to allocate DSP memory such that copies are not needed Calling existing DSP code from the ARM - Wrapping existing DSP functions with OpenMP 6 Accelerator code is straightforward

Memory • Shared memory visible by both the TI 66AK2H12 ARM subsystem ARM and DSP 4 MB ARM shared memory – A portion of the 8GB DDR3 DRAM (external) ARM 0 ARM 1 ARM 2 ARM 3 – The 6MB SRAM shared memory 6 MB ARM and DSP shared memory • 8 GB Performance keys DRAM DSP subsystem – Allocate data in the shared memory for 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB ARM setup and DSP acceleration L2 L2 L2 L2 L2 L2 L2 L2 – Use clmalloc() to allocate contiguous 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB L1 L1 L1 L1 L1 L1 L1 L1 blocks that can be efficient transferred DSP DSP DSP DSP DSP DSP DSP DSP using DMA 0 1 2 3 4 5 6 7 • Options – Let the tools take care of the data movement using assign workgroup and strided copy functions – Manually manage the data movement using DMA (e.g., define buffers available for the DSP in OpenCL and manage the actual data movement on the DSP) 7

Dense Linear Algebra Philosophy 8

BLIS Cortex-A15 DGEMM Multicore Performance • Peak performance: 9.6 GFLOPS • DGEMM performance is ~ 8.4 GFLOPS (83% peak)) 9

Recall - Memory How can we improve this performance? TI 66AK2H12 ARM subsystem 4 MB ARM shared memory • The BLIS implementation on the DSP does not ARM 0 ARM 1 ARM 2 ARM 3 utilize the different levels of memory efficiently. 6 MB ARM and DSP shared memory 8 GB DRAM • Utilize the DMA (Direct Memory Access) DSP subsystem 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB capabilities of the DMA to move data in parallel L2 L2 L2 L2 L2 L2 L2 L2 to the computations 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB L1 L1 L1 L1 L1 L1 L1 L1 DSP DSP DSP DSP DSP DSP DSP DSP 0 1 2 3 4 5 6 7 10

Cache Exploitation and DMA 11

Cache Exploitation and DMA Details 12

DMA Integration Goals • Flexible User or library developer must be able to select when and where to transfer data for an operation • Transparent User must not be aware of the usage of the DMA, but if desired can manage the DMA • Integrated into the control tree mechanism 13

Algorithmic Variants for GEMM 14

GEMM Control Tree Definitions 15

Algorithmic Variants for GEMM with DMA Integration 16

GEMM Control Tree Definitions with DMA Integration 17

Memory Buffers 18

Current Status of DMA Integration in GEMM • Implemented multithreaded prototype of DMA Control Tree with decoding in Block Variant 1 using memcpy instead of DMA • Pending – Decoding of DMA Control Tree in other variants – Invoking DMA routines 19

Thank you! A special thanks to Tyler M. Smith Field G. Van Zee Robert van de Geijn

Integrating DMA capabilities into BLIS for on-chip data movement - PowerPoint PPT Presentation

Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Pea Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers development effort

Blis Connor Abbott, Wendy Pan, Klint Qinami, Jason Vaccaro Motivation: Why Blis? OpenGL is

SYSC3601 Microprocessor Systems Unit 8: Direct Memory Access (DMA) Topics/Reading 1. DMA 2.

Mastering the DMA and IOMMU APIs Embedded Linux Conference Europe 2014 Dsseldorf Laurent

Integrating Problem Solving 2020 Integrating Problem Solving 2020 Integrating Problem Solving

Packing - the next BLIS Fron5er? Tze Meng Low BLIS

Presentation Topics M I C R O P R O C E S S O R 2 N D S E M Interrupt Operations DMA

Integrating DMA attacks in exploitation frameworks Rory Breuk Albert Spruyt University of

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

Integrating LiDAR data into the Integrating LiDAR data into the workflow of cartographic workflow

Extending the BLIS Analytical Model for GPUs Elliot Binder, Claudia Kho, Doru Thom Popovici, Tze

BLIS Performs Devangi N. Parikh Science of High Performance Compu8ng The University of Texas at

Why is Data Marketers Most Valuable Asset? About the DMA Resources Why is Data Marketers

USCG 406 MHz DF Capabilities USCG 406 MHz DF Capabilities 2008 Beacon Manufacturers Workshop

Study Of Chip Breaker El-Sherbeeny, PhD 2014 Project-Group 6 TYPES ES OF F CHI HIP a)

Australian Junior Resources Blue Chip Australian Junior Resources Blue Chip Australian Junior

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

paclitaxel/carboplatin versus paclitaxel/carboplatin/maintenance letrozole versus letrozole

Thompson Sampling on Symmetric -Stable Bandits Abhimanyu Dubey and Alex Pentland Massachusetts

Compiler Development (CMPSC 401) ARM Architecture Janyl Jumadinova April 4, 2019 Janyl

Next generation mobile rootkits Hack In Paris 2013 leveldown security Thomas Roth Hi!

ARM Assembler Strings Strings p. 1/16 Characters or Strings A string is a sequence of

LEADING COLLABORAT ION IN THE ARM ECOSYSTEM Linaro workshop Open Source HPC Collaboration on

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of

Load-reserve / Store-conditional on POWER and ARM Peter Sewell (slides from Susmit Sarkar) 1