cam constraint aware application mapping for application
play

CAM: CONSTRAINT AWARE APPLICATION MAPPING FOR APPLICATION MAPPING - PowerPoint PPT Presentation

CAM: CONSTRAINT AWARE APPLICATION MAPPING FOR APPLICATION MAPPING FOR EMBEDDED SYSTEMS Luis A. Bathen, Nikil D. Dutt Outline 2 Introduction & Motivation Introduction & Motivation CAM Overview Memory aware Macro


  1. CAM: CONSTRAINT ‐ AWARE APPLICATION MAPPING FOR APPLICATION MAPPING FOR EMBEDDED SYSTEMS Luis A. Bathen, Nikil D. Dutt

  2. Outline 2 � Introduction & Motivation � Introduction & Motivation � CAM Overview � Memory ‐ aware Macro ‐ Pipelining � Customized Security Policy Generation � Customized Security Policy Generation � Related Work � Conclusion CASA '10 11/5/2010

  3. Outline 3 � Introduction & Motivation � Introduction & Motivation � CAM Overview � Memory ‐ aware Macro ‐ Pipelining � Customized Security Policy Generation � Customized Security Policy Generation � Related Work � Conclusion CASA '10 11/5/2010

  4. Software/Hardware Co ‐ Design / g 4 Application � Given an existing applications, designers can li ti d i � Design a customized platform Mapping � Dedicated logic Process � Custom memory hierarchy / � Custom memory hierarchy / communication architecture � Take an existing platform and In this presentation we will focus on Scheduler CMP efficiently map the CPU1 Core CPU2 Core CPU n Core the application mapping process on Controller/ CMP CPU1 Core CPU2 Core CPU n Core Off ‐ chip application on it application on it CMP CMP B A CMPs memory DMA CPU1 Core CPU2 Core CPU n Core Off ‐ chip SPM1 SPM2 SPM n memory DMA � Data allocation and Task Off ‐ chip FIFO FIFO DC_LS, MCT, … SPM2 SPM n BPC/BAC SPM1 memory Data Data DMA Collector Dispatcher FIFO FIFO CPU SPM2 DWT w/iDMA SPM n mapping BPC/BAC SPM1 Data Data Data Data FIFO FIFO Tier2 BPC/BAC Data Data � Start with an existing Data Data AMBA 2.0 AMBA 2 0 BPC/BAC BPC/BAC BPC/BAC BPC/BAC platform and customize it to FIFO FIFO FIFO FIFO Data Data Data Data Data Data BPC/BAC BPC/BAC DWT w/iDMA DWT w/iDMA BPC/BAC satisfy the requirements FIFO FIFO FIFO FIFO Dispatcher Collector FIFO FIFO Image/ Data Data Data Data BPC/BAC DWT w/iDMA Data Data Bitstream FIFO FIFO Dispatcher BPC/BAC BPC/BAC Collector Data Data FIFO FIFO FIFO FIFO BPC/BAC � Add custom blocks, and reuse FIFO FIFO components p A B A A B B Controller/ Scheduler CASA '10 11/5/2010

  5. Target Platforms (Chip Multiprocessors) Multiprocessors) 5 Multiple low power RISC Well suited for applications with high levels of Well suited for applications with high levels of cores cores parallelism CPU CPU DMA and SPM I$ SPM I$ SPM DMA Support I$ SPM I$ SPM RAM RAM Bus based systems Bus based systems – still still CPU CPU most commonly used CASA '10 11/5/2010

  6. Motivation 6 Typical Mapping Process CMP CPU1 Core CPU2 Core Off chip Off ‐ chip Platform memory DMA C/C++ SPM1 SPM2 Definition e.g. iteration partitioning, unrolling, tiling… Apply loop Apply loop T2 Generate input task graph Generate input task graph 2 T . T 4 to scheduler optimizations T 1 T 5 T1 T2. T 1 3 What do we care about? Define Task Define Task energy performance? energy, performance? Mapping and T5 T1 T2.1 T3 CPU1 T2 Schedule CPU2 T4 T2.2 The whole process depends on p p T3 Task 1 Data Sets Task 1 Data Sets Task 3 Data Sets Task 3 Data Sets Task 2.1 Data Sets Task 2 1 Data Sets T4 the available resources Define Data Time Placement T5 Task 2.2 Data Sets Si Size ISS, CMP ISS? Simulate/Verify CASA '10 11/5/2010

  7. Motivation (Cont.) ( ) 7 CMP CPU1 Core CPU2 Core CPU n Core Platform Off ‐ chip memory DMA Definition SPM1 SPM2 SPM n Apply loop optimizations ti i ti T 4 f T 1 T 5 T 3 This dependence shows the need to evaluate different Define Task optimizations, schedules, placements for power and optimizations schedules placements for power and M Mapping and i d T1 T2.1 T3 T5 CPU1 T2. performance in a quick yet accurate fashion Schedule CPU2 T4 2 T2.3 CPU3 T2. CPU4 4 Define Data Time Placement Size Simulate/Verify CASA '10 11/5/2010

  8. Outline 8 � Introduction & Motivation � Introduction & Motivation � CAM Overview � Memory ‐ aware Macro ‐ Pipelining � Customized Security Policy Generation � Customized Security Policy Generation � Related Work � Conclusion CASA '10 11/5/2010

  9. CAM: Constraint ‐ aware Application Mapping for Embedded Systems Mapping for Embedded Systems 9 • Efficiently utilize memory resources • Very secure might mean very Very secure might mean very • Voltage/Frequency scaling power hungry/slow affect performance • Limited multiprocessor • Limits type of security support mechanisms • Solutions are generic Security Power •Policy Generation •Data partitioning and distribution •Selective Enforcement •Data reuse Data Application •Memory ‐ aware Placement/Sched Scheduling C/C++ ule/Policies Performance Performance •Task/kernel partitioning •Macro ‐ pipelining • •Early Execution Edges Fully utilize compute resources • Increased Parallelism • Increased vulnerabilities Increased vulnerabilities CASA '10 11/5/2010

  10. CAM Overview 10 Front End Front End Middle End Middle End CMP Define CMP Application Pre ‐ CPU1 Core CPU2 Core CPU n Core Template Off ‐ chip processing memory DMA SPM1 SPM2 SPM n (CFG extraction, task graph generation, input h i i model generation) End up with End up with Task Decomposition Augmen Nope, let’s see if Task C1 K2 K3 Task G 1 Data Reuse Analysis massive task massive task massive task massive task increasing degree of increasing degree of Graph ntation Early Execution Edge unrolling (in loops) Task graphs graphs Generation C4 K5 C6 2 helps, tile size? Memory ‐ Aware Macro ‐ Pipelining Back End Back End CPU1 CPU1 C1 C1 K2 K2 CPU2 CPU2 C6 C6 K3 K3 Very tightly Very tightly Performance Meet Energy and CPU3 CPU3 C4 C4 K5 K5 Model Performance Constraints? Performance Constraints? Generation Generation coupled process! coupled process! l d l d ! ! SPM1 SPM1 K2D K2D K3D K3D CPU1 Core CPU2 Core CPU n Core CMP Off ‐ chip CPU1 Core CPU2 Core CPU n Core DM CMP SPM SPM SPM SPM SPM SPM memory Off ‐ chip A CPU1 Core CPU2 Core CPU n Core DM K5D K5D SPM2 SPM2 CMP 1 1 2 2 n n SPM SPM SPM SPM SPM SPM memory Off ‐ chip A DM CASA '10 11/5/2010 1 1 2 2 n n SPM SPM SPM SPM SPM SPM memory A 1 1 2 2 n n

  11. Outline 11 � Introduction & Motivation � Introduction & Motivation � CAM Overview � Memory ‐ aware Macro ‐ Pipelining � ESTImedia ‘08, ’09 � ESTImedia 08, 09 � Customized Security Policy Generation � Related Work l d k � Conclusion CASA '10 11/5/2010

  12. Application Domain Example (JPEG2000) (JPEG2000) 12 Task Set (T) Supports multiple levels of data parallelism t 1 t 1 t 2 t 2 t n t n t 1 t 1 DWT Quant. EBCOT t 2 DWT Quant. EBCOT t 3 DWT Quant. EBCOT t m t mn t mn DWT Quant. EBCOT CASA '10 11/5/2010

  13. Inter ‐ kernel Reuse Opportunities pp 13 We target our approach to data intensive � streaming applications streaming applications Task level parallelism, Data level parallelism � Examples � Inter ‐ kernel data reuse Cache based systems are not Cache based systems are not Macroblock level (H.264) Macroblock level (H.264) � � opportunities are often suitable to meet these types Component level, Tile level, Code block level (JPEG2000) � ignored of applications void tiling() void tiling() { void mct() void dcls() // input: Yr, Ur, Vr { { { // output: n x tY tV tU // output: n x tY, tV, tU // input: B,G, R // input: B,G, R // output: Yr, Ur, Vr // output: B,G, R for( i=0; i<m; i+=tw) { for( j=0; j<n; j+=th) { for ( i = 0; i < width; i++) { for ( i = 0; i < width; i++) { for( k=0; k<tw; k++) { for ( j = 0; j < height; j++) { for ( j = 0; j < height; j++) { for( l=0; l<th; l++) { B[i][j] = B[i][j] tY[k][l] = Yr[i+k][j+l]; Yr[i][j] = [ ][j] - pow(2, info->siz - 1); po (2 info >si 1) tU[k][l] = Ur[i+k][j+l]; ceil((float)(R[i][j] + (2*(G[i][j])) + G[i][j] = B[i][j] tV[k][l] = Vr[i+k][j+l]; B[i][j])/4); - pow(2, info->siz - 1); } Ur[i][j] = B[i][j] - G[i][j]; R[i][j] = B[i][j] } Vr[i][j] = R[i][j] - G[i][j]; - pow(2, info->siz - 1); yCoeff=dwt(tY); } } yQ=quant(yCoeff); } } } } ebcot(yQ); ebcot(yQ); } } ……………… ……………... CASA '10 11/5/2010

  14. Access Patterns and Data Requirements Requirements 14 Our proposal: Take kernels that produce large data streams and decompose them into smaller kernels producing smaller data streams decompose them into smaller kernels producing smaller data streams ress Add Iteration Iteration Task/Kernel Data Requirements: The Problem: � Data is read in, and written out by each DCLS: Consumption=Production=3MB � task Cannot keep ALL data in SPM and task. Cannot keep ALL data in SPM, and MCT: Same as DCLS, 3MB pass it to the next task. � Tiling: � Consumption = same as MCT, Production: 3 tiles at a time, 128x128 pixels (16KB), total of 16 ( ) � x 3 tiles CASA '10 11/5/2010

Recommend


More recommend