exploiting coarse grained task data and pipeline
play

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in - PowerPoint PPT Presentation

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose, CA http://cag.csail.mit.edu/streamit


  1. Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose, CA http://cag.csail.mit.edu/streamit 1

  2. Multicores Are Here! 512 Picochip Ambric PC102 AM2045 256 Cisco CSR-1 128 Intel Tflops 64 32 # of Raza Cavium Raw XLR Octeon 16 cores Cell 8 Niagara Opteron 4P Broadcom 1480 4 Xeon MP Xbox360 PA-8800 Tanglewood Opteron 2 Power4 PExtreme Power6 Yonah 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium 1 P4 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20?? 2

  3. Multicores Are Here! For uniprocessors, 512 Picochip Ambric Uniprocessors: PC102 C was: AM2045 256 Cisco C is the common CSR-1 •Portable 128 Intel Tflops machine language •High Performance 64 32 •Composable # of Raza Cavium Raw XLR Octeon 16 cores •Malleable Cell 8 Niagara •Maintainable Opteron 4P Broadcom 1480 4 Xeon MP Xbox360 PA-8800 Tanglewood Opteron 2 Power4 PExtreme Power6 Yonah 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium 1 P4 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20?? 3

  4. Multicores Are Here! What is the common 512 Picochip Ambric PC102 machine language AM2045 256 Cisco CSR-1 for multicores? 128 Intel Tflops 64 32 # of Raza Cavium Raw XLR Octeon 16 cores Cell 8 Niagara Opteron 4P Broadcom 1480 4 Xeon MP Xbox360 PA-8800 Tanglewood Opteron 2 Power4 PExtreme Power6 Yonah 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium 1 P4 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20?? 4

  5. Common Machine Languages Uniprocessors: Multicores: Common Properties Common Properties Multiple flows of control Single flow of control Single memory image Multiple local memories Differences: Differences: Register Allocation Number and capabilities of cores Register File Instruction Selection ISA Communication Model Instruction Scheduling Synchronization Model Functional Units von-Neumann languages represent the Need common machine language(s) common properties and abstract away for multicores the differences 5

  6. Streaming as a Common Machine Language AtoD • Regular and repeating computation FMDemod • Independent filters Scatter with explicit communication – Segregated address spaces and LPF 1 LPF 2 LPF 3 multiple program counters HPF 1 HPF 2 HPF 3 • Natural expression of Parallelism: Gather – Producer / Consumer dependencies – Enables powerful, whole-program Adder transformations Speaker 6

  7. Types of Parallelism Task Parallelism – Parallelism explicit in algorithm – Between filters without producer/consumer relationship Scatter Data Parallelism – Peel iterations of filter, place within scatter/gather pair ( fission ) – parallelize filters with state Gather Pipeline Parallelism – Between producers and consumers – Stateful filters can be parallelized Task 7

  8. Types of Parallelism Task Parallelism Scatter – Parallelism explicit in algorithm Data Parallel – Between filters without producer/consumer relationship Gather Scatter Pipeline Data Parallelism – Between iterations of a stateless filter – Place within scatter/gather pair ( fission ) – Can’t parallelize filters with state Gather Pipeline Parallelism – Between producers and consumers Data – Stateful filters can be parallelized Task 8

  9. Types of Parallelism Traditionally: Scatter Task Parallelism Gather – Thread (fork/join) parallelism Scatter Pipeline Data Parallelism – Data parallel loop ( forall ) Pipeline Parallelism – Usually exploited in hardware Gather Data Task 9

  10. Problem Statement Given : – Stream graph with compute and communication estimate for each filter – Computation and communication resources of the target machine Find: – Schedule of execution for the filters that best utilizes the available parallelism to fit the machine resources 10

  11. Our 3-Phase Solution Coarsen Data Software Granularity Parallelize Pipeline 1. Coarsen: Fuse stateless sections of the graph 2. Data Parallelize: parallelize stateless filters 3. Software Pipeline: parallelize stateful filters Compile to a 16 core architecture – 11.2x mean throughput speedup over single core 11

  12. Outline • StreamIt language overview • Mapping to multicores – Baseline techniques – Our 3-phase solution 12

  13. The StreamIt Project • Applications StreamIt Program – DES and Serpent [PLDI 05] – MPEG-2 [IPDPS 06] – SAR, DSP benchmarks, JPEG, … Front-end • Programmability – StreamIt Language (CC 02) – Teleport Messaging (PPOPP 05) Annotated Java – Programming Environment in Eclipse (P-PHEC 05) • Domain Specific Optimizations – Linear Analysis and Optimization (PLDI 03) Simulator Stream-Aware – Optimizations for bit streaming (PLDI 05) (Java Library) Optimizations – Linear State Space Analysis (CASES 05) • Architecture Specific Optimizations – Compiling for Communication-Exposed Uniprocessor Cluster Raw IBM X10 Architectures (ASPLOS 02) backend backend backend backend – Phased Scheduling (LCTES 03) – Cache Aware Optimization (LCTES 05) – Load-Balanced Rendering C/C++ MPI-like C per tile + Streaming C/C++ msg code X10 runtime (Graphics Hardware 05) 13

  14. Model of Computation • Synchronous Dataflow [Lee ‘92] A/D – Graph of autonomous filters – Communicate via FIFO channels Band Pass • Static I/O rates Duplicate – Compiler decides on an order of execution (schedule) Detect Detect Detect Detect – Static estimation of computation LED LED LED LED 14

  15. Example StreamIt Filter input 0 1 2 3 4 5 6 7 8 9 10 11 FIR output 0 1 float → float filter FIR (int N,float[N] weights) { Stateless work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i] ∗ peek (i); } pop (); push (result); } } 15

  16. Example StreamIt Filter input 0 1 2 3 4 5 6 7 8 9 10 11 FIR output 0 1 float → float filter FIR (int N, ) { (int N) { float[N] weights Stateful ; work push 1 pop 1 peek N { float result = 0; weights = adaptChannel(weights ); for (int i = 0; i < N; i++) { result += weights[i] ∗ peek (i); } pop (); push (result); } } 16

  17. StreamIt Language Overview filter • StreamIt is a novel language for streaming pipeline may be – Exposes parallelism and any StreamIt communication language construct – Architecture independent splitjoin parallel computation – Modular and composable – Simple structures composed to creates splitter joiner complex graphs – Malleable – Change program behavior feedback loop with small modifications splitter joiner 17

  18. Outline • StreamIt language overview • Mapping to multicores – Baseline techniques – Our 3-phase solution 18

  19. Baseline 1: Task Parallelism • Inherent task parallelism between Splitter two processing pipelines BandPass BandPass • Task Parallel Model: Compress Compress – Only parallelize explicit task parallelism Process Process – Fork/join parallelism Expand Expand • Execute this on a 2 core machine ~2x speedup over single core BandStop BandStop Joiner • What about 4, 16, 1024, … cores? Adder 19

  20. Evaluation: Task Parallelism Raw Microprocessor Parallelism: Not matched to target! 19 18 16 inorder, single-issue cores with D$ and I$ Synchronization: Not matched to target! 17 Throughput Normalized to Single Core StreamIt 16 memory banks, each bank with DMA 16 Cycle accurate simulator 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 r k n t T S o t E r r e T r r n e e n a a o d C F i D E d d e d d a e S o F a D D p T o o b a M c c R r c c r R o e i e e o n M c V S D o t V l i l i F r t e F 2 t i e B G n m n E a o P h e M C G 20

  21. Baseline 2: Fine-Grained Data Parallelism Splitter • Each of the filters in the Splitter Splitter example are stateless BandPass BandPass BandPass BandPass BandPass BandPass BandPass BandPass • Fine-grained Data Parallel Joiner Joiner Splitter Splitter Model: Compress Compress Compress Compress Compress Compress Compress Compress Joiner Joiner – Fiss each stateless filter N Splitter Splitter ways ( N is number of cores) Process Process Process Process Process Process Process Process Joiner Joiner – Remove scatter/gather if Splitter Splitter possible Expand Expand Expand Expand Expand Expand Expand Expand • We can introduce data Joiner Joiner Splitter Splitter parallelism BandStop BandStop BandStop BandStop BandStop BandStop BandStop BandStop Joiner Joiner – Example: 4 cores Joiner • Each fission group occupies entire machine Splitter BandStop BandStop BandStop Adder Adder Joiner 21

Recommend


More recommend