Multi-platform Automatic Parallelization and Power Reduction by OSCAR Compiler Hironori Kasahara Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute Waseda University, Tokyo, Japan IEEE Computer Society Board of Governors IEEE Computer Society Multicore STC Chair URL: http://www.kasahara.cs.waseda.ac.jp/
OSCAR Parallelizing Compiler To improve effective performance, cost-performance and software productivity and reduce power Multigrain Parallelization coarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism 1 Data Localization 3 2 5 4 6 12 7 10 11 9 8 Automatic data management for 14 13 15 16 distributed shared memory, cache and local memory 18 17 19 21 22 20 24 25 23 26 Data Transfer Overlapping dlg1 dlg2 dlg3 dlg0 28 29 27 31 32 30 Data transfer overlapping using Data Data Localization Group 33 Transfer Controllers (DMAs) Power Reduction Reduction of consumed power by compiler control DVFS and Power gating with hardware supports.
Multicore Program Development Using OSCAR API V2.0 Sequential Application OSCAR API for Homogeneous and/or Generation of Heterogeneous Multicores and manycores Program in Fortran or C parallel machine Directives for thread generation, memory, (Consumer Electronics, Automobiles, codes using data transfer using DMA, power Medical, Scientific computation, etc.) sequential managements compilers Hetero Manual Low Power parallelization / Homogeneous Parallelized power reduction Homogeneous API F or C Multicore Code Generation program Accelerator Compiler/ User Executable on various multicores Homegeneous Add “hint” directives Existing Proc0 API Multicore s before a loop or a function to sequential Analyzer from Vendor A Code with specify it is executable by compiler (SMP servers) directives the accelerator with Low Power Thread 0 how many clocks Heterogeneous Proc1 Multicore Code Generation Waseda OSCAR Code with API Existing Parallelizing Compiler Heterogeneous directives Analyzer sequential Multicores Thread 1 Coarse grain task (Available compiler from Vendor B parallelization from Accelerator 1 Waseda) Data Localization Code DMAC data transfer Accelerator 2 Server Code Power reduction using Code Generation DVFS, Clock/ Power gating OpenMP Compiler Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, OSCAR: Optimally Scheduled Advanced Multiprocessor Olympus, Mitsubishi, Shred memory API : Application Program Interface Esol, Cats, Gaio, 3 univ. servers
Model Base Designed Engine Control on V850 Multicore with Denso Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor. Hard real-time automobile engine control by multicore 2 cores 1 core
Parallelizing Handwritten Engine Control Programs on Multi ‐ core processors • Current automotive crankshaft program – Developed by TOYOTA Motor Corp – About 300,000 Lines – Difficulty of parallel processing • Too fine granularity • Many conditional branches and small basic blocks, but no parallelizable loops – Minimizing run ‐ time overhead and improvement of parallelism are necessary Current product compilers can not parallelize Current accelerators are not applicable Automatic parallelization of a crankshaft program using multi ‐ grain parallelization in OSCAR Compiler Performance improvement and efficient multi ‐ threaded programming development 2013/04/19 Cool Chips XVI 5
Analysis of Coarse Grain Parallelism by OSCAR Compiler Decomposes a program into coarse grain tasks, or macro tasks(MTs) Data Dependency BB (Basic Block) 1. 1 RB (Repetition Block, or loop) 2. SB (Subroutine Block, or function) 3. 2 3 Generate MFG(Macro Flow Graph) : Data Dependency : Control Flow 4 Control flow and data dependencies : Conditional Branch Generates MTG(Macro Task Graph) 5 Coarse grain parallelism 6 Control Flow 7 1 8 2 3 Earliest Executable Condition 9 10 13 4 6 8 11 12 5 10 9 11 Conditional Branch 13 7 12 14 Macro-Flow Graph 14 2013/04/19 Cool Chips XVI Macro-Task Graph 6
Coarse Grain Task Parallelization of Hand-written Engine Control Program Loop parallelization No parallelizable loops in engine control codes Fine grain parallelization Each BBs are very low cost - less than 100 clock cycles Branches prevent compilers Coarse grain parallelization Utilize parallelism between SBs and BBs MTG of crankshaft programs 2013/04/19 Cool Chips XVI 7
Static Task Scheduling Dynamic task scheduling Prevent from traceability Add run-time overhead Static task scheduling Guarantee Real-time constraints Ensure traceability Minimize run-time overhead Cannot assign BBs having braches statically Static task scheduling can be applied if the MTG has only data dependency The compiler cannot see if the branch is taken or not at compile time. Fuse tasks by hiding conditional branches in MFG to avoid dynamic task scheduling Macro Task Fusion MFG of sample program 2013/04/19 Cool Chips XVI 8
Analysis of A Crankshaft Program Using Macro Task Fusion sb4 and block5 account for over 90% of whole execution time. Can not schedule MTs at compile time MTG of crankshaft program before macro MTG of crankshaft program after task fusion macro task fusion There is not enough parallelism 2013/04/19 Cool Chips XVI 9
MTG of Crankshaft Program Using Inline Expansion and Duplicating If-statements CP accounts for over 99% of whole execution time. Critical Path(CP) Critical Path(CP) CP accounts for about 60% of whole execution time. MTG of crankshaft program before restructuring Succeed to reduce CP 99% -> 60% MTG of crankshaft program after restructuring Successfully increased coarse grain parallelism 2013/04/19 Cool Chips XVI 10
Evaluation Environment : Embedded Multi-core Processor RPX . t- • SH-4A 648MHz * 8 – As a first step, we use just two SH-4A cores because target dual-core processors are currently under design for next-generation automobiles 2013/04/19 Cool Chips XVI 11
Evaluation of Crankshaft Program with Multi- core Processors 1.80 0.60 0.57 1.54 1.60 execution time(us) 0.50 1.40 speedup ratio 1.20 0.40 1.00 0.37 1.00 0.30 0.80 0.60 0.20 0.40 0.10 0.20 0.00 0.00 1 core 2 core • Attain 1.54 times speedup on RPX – There are no loops, but only many conditional branches and small basic blocks and difficult to parallelize this program • This result shows possibility of multi-core processor for engine control programs 2013/04/19 Cool Chips XVI 12
Performance of OSCAR Compiler on Intel Core i7 Notebook PC 4.50 4.12 Intel Compiler Ver.14.0 CPU: Intel Core i7 3720QM (Quad ‐ core) 4.00 MEM: 32GB DDR3 ‐ SODIMM PC3 ‐ 12800 OSCAR OS: Ubuntu 12.04 LTS 3.50 2.91 3.00 Speesup Ratio 2.70 2.53 2.50 2.24 2.00 1.70 1.50 1.18 1.00 1.00 1.00 1.00 0.50 0.00 SPEC95 su2cor SPEC95 SPEC95 mgrid SPEC95 turb3d AAC Encoder hydro2d • OSCAR Compiler accelerate Intel Compiler about 2.0 times on average
Parallel Processing of JPEG XR Encoder on TILEPro64 Speedup (JPEG XR Encoder) 55x speedup on 64 cores Multimedia Applications: AAC Encoder { JPEG XR Encoder 60.00 55x Sequential C Source Code Optical Flow Calc. Default Cache Allocation Our Cache Allocation 50.00 OSCAR Compiler Cache 40.00 Parallelized C Program (2)Cache Allocation Setting Allocation Speedup with OSCAR API Setting 30.00 28x API Analyzer + 20.00 Sequential Compiler (1)OSCAR Parallelization 10.00 Parallelized Executable Binary 1x for TILEPro64 0.00 1 64 Memory Memory Controller 0 Controller 1 0 0 0 0 0 0 0 0 , , , , , , , , # Cores 0 1 2 3 4 5 6 7 1 1 1 1 1 1 1 1 , , , , , , , , 0 1 2 3 4 5 6 7 2 2 2 2 2 2 2 2 al) , , , , , , , , Local cache optimization: t0 0 1 2 3 4 5 6 7 I/O 3 3 3 3 3 3 3 3 X4) , , , , , , , , Ds 0 1 2 3 4 5 6 7 I/O 4 4 4 4 4 4 4 4 Parallel Data Structure (tile) on heap , , , , , , , , 0 1 2 3 4 5 6 7 5 5 5 5 5 5 5 5 , , , , , , , , nal) 0 1 2 3 4 5 6 7 6 6 6 6 6 6 6 6 rt 1 allocating to local cache , , , , , , , , X4) 0 1 2 3 4 5 6 7 I/O 7 7 7 7 7 7 7 7 , , , , , , , , 0 1 2 3 4 5 6 7 Memory Memory Controller 2 Controller 3
Parallel Processing of Face Detection on Manycore, Highendand PC Server 14.00 速度向上率 tilepro64 gcc SR16k(Power7 8core*4cpu*4node) xlc 11.55 12.00 10.92 rs440(Intel Xeon 8core*4cpu) icc 10.00 9.30 8.00 速度向上率 6.46 6.46 5.74 6.00 3.67 3.57 4.00 3.01 1.93 1.93 1.72 2.00 1.00 1.00 1.00 0.00 コア数 1 2 4 8 16 • OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core on SR16000 Power7 highendserver . 15 Automatic Parallelization of Face Detection
92 Times Speedup against the Sequential Processing for GMS Earthquake Wave Propagation Simulation on Hitachi SR16000 ( Power7 Based 128 Core Linux SMP ) 100 Speedup against sequential processing 90 80 70 60 oscar 50 40 30 20 10 0 1pe 2pe 4pe 8pe 16pe 32pe 64pe 128pe
Recommend
More recommend