Filip Blagojevi ć , Costin Iancu, Katherine Yelick, Matthew Curtis-Maury, Dimitrios S. Nikolopoulos, Benjamin Rose (presented by Rajesh Nishtala) For questions please email upc@lbl.gov
• Heterogeneous architectures are High Performance, Cost Effective, Power-Effective • Cell BE, FPGA, GPGPU, Larrabee, Rapport Kilocore • Execution model considered: off-loading • Enables relatively easy and efficient porting of the existing applications • Achieves high performance and utilization of the architectures • Off-loading requires efficient PPE-SPE communication and synchronization • Idle time on accelerators • Co-scheduling policies required for higher utilization
• Efficient chip utilization requires multigrain parallelism: PPE oversubscription required for performance (RAxML) • Parallelization balance depends on application characteristics (1-N PPE-SPE) • • Efficient PPE-SPE synchronization is required Number of off-loads in an applications is large (>100,000) • • Linux is unaware of the offloading execution
• Cell SDKs Callbacks - register event handler to respond to SPE requests • Interrupt Mailboxes - PPE process performs blocking call to read a mailbox • • Busy-wait (S/P)PE polls on a shared memory variable Busy Waiting Wait_for_SPE(sync_t *flag){ while(!flag); } S3 SPEs S2 S1 S1 PPE P1 P2 P3 P1 OS quantum SPE idle
• Busy-wait with yielding Yield-If-Not-Ready (YNR) • Cooperative Scheduling Slack-Minimizing Event-Driven Scheduler (SLED) Yield if Not Ready Ideal Wait_for_SPE(sync_t *flag){ SLED_Wait_for_SPE(){ while(!flag){ while(not_done){ sched_yield(); Determine SPE ready; }} yield_to(ready); }} S3 S3 SPEs S2 S2 S1 S1 S1 S1 PPE P1 P2 P3 P1 P2 P3 P1 P2 P3 . . . P1 P2 P3 P1 P2 P3 P1 P2 P3 . . . yield if not ready • Work-Stealing YNR, SLED 1-1 PPE-SPE mapping; WS any-any mapping •
PPE with multiple Kernel processes Ready-To-Run List SPE1 yield_to(pid) 1 2 SPE2 SPE3 3 4 SPE4 5 6 SPE5 SPE6 7 8 SPE7 SPE8 PPE schedules process which is ready to run • Task scheduling: yield_to(pid) system call • Signaling and task selection • Shared memory data structure • Evaluated both user and kernel level interfaces and implementations
• Evaluated list and array based data structures • Trade-off between ordering and fast access • List – FIFO maximizes utilization but requires mutual exclusion • Array – no ordering but avoids synchronization • Design: • Split array data structure • Processes pinned to PPE h/w contexts • Static partitioning SPEs to PPE h/w contexts • Evaluated kernel and user level placement (SPE idle time per offload) • User-Level = 7us • Kernel-Level = 9us
• No kernel support and no system calls • User-Level Work Stealing using BUPC • Work pool resides in memory shared between processes (any-any PPE-SPE) • Signaling mechanism identical to SLED, no pinning requirements Off-loaded task can be PPE Processes served by any process 1 2 1 2 3 3 4 SPE 4 . . . 5 5 6 6 . . . 7 8 PPE 7 8
• Microbenchmarks • Multiple PPE processes off-load SPE tasks of various length • Evaluated kernel/user level implementation • Evaluated multiple signaling data structure designs • Evaluated SDK synchronization primitives Bioinformatics applications to generate Phylogenetic trees (PBPI,RAxML) Evolutionary history among a set of species Computationally expensive NP-hard algorithms
RAxML (Maximum Likelihood) – Master-Worker, Embarrassingly Parallel Work unit is multiple loops (three) Communication between M-W only after unit is completed Multiple code entry points into an SPE code module (work-stealing requires significant re-engineering) PBPI (Parallel Bayesian Phylogenetic Inference) - Data Parallel Three main loops offloaded separately ALL_TO_ALL communication after each loop body Whole loop body offloaded in both applications Over 95% of execution time spent on SPEs in both applications
Better Better PPE-SPE Synchronization Overhead SPE Utilization • “MBOX” - synchronization via interrupt mailboxes 19us • “PPE” - yielding overhead with oversubscription (6 processes on PPE) 14us • “YNR” < “PPE” due to congestion • Average latency : Work-Steal 3us, SLED 7us, YNR 10us Faster synchronization leads to improved SPE utilization
RAxML Speedup PBPI Speedup 6% 25% SLED-Kernel SLED User 5% SLED-User SLED Kernel 20% DNA Sequence Length DNA Sequence Length Better Work-Steal 4% 15% 3% 2% 10% 1% 5% 0% 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 -1% 0% 204 444 684 924 1164 1404 1644 1884 2124 2364 2604 2844 3084 3324 3564 3804 4044 4284 4524 4764 -2% -5% -3% -4% -10% Speedup Speedup • Work-Stealing improves the performance for • Work-stealing for RAxML produces a 23% up to 21% (12% avg) slowdown for the workload (work migration implementation is not a full Continuation • SLED improves the performance for up to Passing Style) 10% (5% avg) • SLED improves the performance for up to 5% (3% avg) For very short tasks (length < context switch) YNR always performs best
Task Distribution 400000 350000 Number of Tasks RAxML 105 300000 RAxML 1176 250000 RAxML 3063 PBPI 105 200000 PBPI 1176 150000 PBPI 3063 100000 50000 0 0-10: 10-20: 20-30: 30-40: 40-50: 50-60: 60-70: 70-80: 80-90: 90-100: 100-200: 200-300: 300-400: 400-500: 500-600: 600-700: 700-800: 800-900: 900-1000: 1000-10000: Task Length • Left-hand y-axis represents the reductions of the idle time when compared to the YNR idle time. • IDLE: (right-hand y axis): time when SPEs are waiting for work as a percentage of total execution time with YNR. • SLED reduces the SPE idle time by up to 20% for both applications • UPC Work-Steal reduces SPE idle time by up to 30% for PBPI • With optimizations SPE still idle 20% (PBPI) and 10% (RAxML)
• Cell execution models: • Offloading: • P. Bellens et al “CellSs: A Programming Model for the Cell BE Architecture” M. de Krujif and K. Sankaralingam “MapReduce for the Cell B.E. Architecture” • K. Fatahalian et al “Memory - Sequoia: Programming the Memory Hierarchy” • • M. Monteyne “RapidMind Multi-core Development Platform” Streaming • • Kudlur & Mahlke • Shared memory model • Eichenberger et al “Optimizing Compiler for the Cell Processor” SPE micro-kernels • • Mohamed F. Ahmed et al “SPENK: Adding Another Level of Parallelism on the Cell Broadband Engine” • Many Application Studies • F. Petrini et al. “Challenges in Mapping Graph Exploration Algorithms on Advanced Multicore Processors” • D. Bader et al. “On the Design and Analysis of Irregular Algorithms on the Cell Processor: A Case Study on List Ranking” • Jayram M. N. “Brain Circuit Bottom-Up Engine Simulation and Acceleration on Cell BE, for Vision Applications”
• Efficient execution on accelerators requires careful scheduling of disjoint parallelism • The current support for synchronization among the heterogeneous cores is not sufficient (callbacks, mailboxes, busy wait) • The cooperative scheduling strategies explored improve performance • Impact of coscheduling will increase as contention on the general purpose core increases (Cell blade with 8SPEs instead of PS3 with 6 SPEs) • Ratio task length/scheduling overhead likely to remain constant in the future architecture revisions
Recommend
More recommend