architectural specialization for inter iteration loop
play

Architectural Specialization for Inter-Iteration Loop Dependence - PowerPoint PPT Presentation

Architectural Specialization for Inter-Iteration Loop Dependence Patterns Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu Zhiru Zhang, Christopher Batten Computer Systems Laboratory School of Electrical and Computer Engineering Cornell


  1. Architectural Specialization for Inter-Iteration Loop Dependence Patterns Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu Zhiru Zhang, Christopher Batten Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University 47th Int’l Symp. on Microarchitecture, Dec 2014

  2. • Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Energy Efficiency (Tasks per Joule) General Purpose Processor Performance (Tasks per Second) Cornell University Shreesha Srinath 2 / 31

  3. • Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Energy Efficiency (Tasks per Joule) Golden Triangle General Purpose Processor Performance (Tasks per Second) Cornell University Shreesha Srinath 2 / 31

  4. • Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Custom Custom Energy Efficiency (Tasks per Joule) ASIC ASIC n o i t a Less Flexible z i l a Accelerator i c e p S More Flexible . s v Accelerator y t i l i b i x e l F General Purpose Processor Performance (Tasks per Second) Cornell University Shreesha Srinath 2 / 31

  5. • Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Loop Dependence Pattern Specialization Iteration Iteration Iteration Iteration Iteration 0 1 2 3 n-1 inst0 inst0 inst0 inst0 inst0 inst1 inst1 inst1 inst1 inst1 inst2 inst2 inst2 inst2 inst2 inst3 inst3 inst3 inst3 inst3 ... ... ... ... ... branch branch branch branch branch Intra-Iteration Micro-op Fusion, ASIPs, CCA Cornell University Shreesha Srinath 3 / 31

  6. • Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Loop Dependence Pattern Specialization Iteration Iteration Iteration Iteration Iteration 0 1 2 3 n-1 inst0 inst0 inst0 inst0 inst0 inst1 inst1 inst1 inst1 inst1 inst2 inst2 inst2 inst2 inst2 inst3 inst3 inst3 inst3 inst3 ... ... ... ... ... branch branch branch branch branch Intra-Iteration Inter-Iteration Micro-op Fusion, Vector, GPU, ASIPs, CCA HELIX-RC Cornell University Shreesha Srinath 3 / 31

  7. • Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Loop Dependence Pattern Specialization Iteration Iteration Iteration Iteration Iteration 0 1 2 3 n-1 inst0 inst0 inst0 inst0 inst0 inst1 inst1 inst1 inst1 inst1 inst2 inst2 inst2 inst2 inst2 inst3 inst3 inst3 inst3 inst3 ... ... ... ... ... branch branch branch branch branch Intra-Iteration Inter-Iteration Both Micro-op Fusion, Vector, GPU, DySER, Qs-Cores, ASIPs, CCA HELIX-RC BERET Cornell University Shreesha Srinath 3 / 31

  8. • Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Loop Dependence Pattern Specialization Iteration Iteration Iteration Iteration Iteration 0 1 2 3 n-1 inst0 inst0 inst0 inst0 inst0 inst1 inst1 inst1 inst1 inst1 inst2 inst2 inst2 inst2 inst2 inst3 inst3 inst3 inst3 inst3 ... ... ... ... ... branch branch branch branch branch Intra-Iteration Inter-Iteration Both Micro-op Fusion, Vector, GPU, DySER, Qs-Cores, ASIPs, CCA HELIX-RC BERET Key Challenge: Creating HW/SW abstractions that are flexible and enable performance-portable execution Cornell University Shreesha Srinath 3 / 31

  9. • Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Explicit Loop Specialization (XLOOPS) Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Cornell University Shreesha Srinath 4 / 31

  10. • Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Explicit Loop Specialization (XLOOPS) Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Key Idea 2: Single-ISA hetereogenous architecture with a new execution paradigm supporting traditional, specialized, and adaptive execution Cornell University Shreesha Srinath 4 / 31

  11. • Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Explicit Loop Specialization (XLOOPS) Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Key Idea 2: Single-ISA hetereogenous architecture with a new execution paradigm supporting traditional, specialized, and adaptive execution I Traditional GPP Execution L1 Data Cache Cornell University Shreesha Srinath 4 / 31

  12. • Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Explicit Loop Specialization (XLOOPS) Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Key Idea 2: Single-ISA hetereogenous architecture with a new execution paradigm supporting traditional, specialized, and adaptive execution I Traditional GPP Lane Manager Execution Lanes I Specialized Execution Mem XBar L1 Data Cache Cornell University Shreesha Srinath 4 / 31

  13. • Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Explicit Loop Specialization (XLOOPS) Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Key Idea 2: Single-ISA hetereogenous architecture with a new execution paradigm supporting traditional, specialized, and adaptive execution I Traditional GPP Lane Manager Execution Lanes I Specialized Execution I Adaptive Mem XBar Execution L1 Data Cache Cornell University Shreesha Srinath 4 / 31

  14. • Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation 1. XLOOPS ISA 2. XLOOPS Compiler loop: #pragma xloops ordered lw r2, 0(rA) for(i = 0; i < N i++) lw r3, 0(rB) A[i] = A[i] * A[i-K]; ... ... #pragma xloops atomic addiu.xi rA, 4 for(i = 0; i < N; i++) addiu.xi rB, 4 B[ A[i] ]++; addiu r1, r1, 1 D[ C[i] ]++; xloop.uc r1, rN, loop 3. XLOOPS Microarchitecture 4. Evaluation OoO GPP Lane Manager Lanes Mem XBar 0 0.5 1.0 1.5 2.0 2.5 L1 Data Cache Cornell University Shreesha Srinath 5 / 31

  15. Motivation • XLOOPS ISA • XLOOPS Compiler XLOOPS Microarchitecture Evaluation 1. XLOOPS ISA 2. XLOOPS Compiler loop: #pragma xloops ordered lw r2, 0(rA) for(i = 0; i < N i++) lw r3, 0(rB) A[i] = A[i] * A[i-K]; ... ... #pragma xloops atomic addiu.xi rA, 4 for(i = 0; i < N; i++) addiu.xi rB, 4 B[ A[i] ]++; addiu r1, r1, 1 D[ C[i] ]++; xloop.uc r1, rN, loop 3. XLOOPS Microarchitecture 4. Evaluation OoO GPP Lane Manager Lanes Mem XBar 0 0.5 1.0 1.5 2.0 2.5 L1 Data Cache Cornell University Shreesha Srinath 6 / 31

  16. Motivation • XLOOPS ISA • XLOOPS Compiler XLOOPS Microarchitecture Evaluation XLOOPS Instruction Set Extensions XLOOP Instruction xloop.{d}.{c} rI, rN, L Data Control Induction Loop Loop Dependence Dependence Variable Bound Label Cornell University Shreesha Srinath 7 / 31

  17. Motivation • XLOOPS ISA • XLOOPS Compiler XLOOPS Microarchitecture Evaluation XLOOPS Instruction Set Extensions XLOOP Instruction xloop.{d}.{c} rI, rN, L Data Control Induction Loop Loop Dependence Dependence Variable Bound Label xloop.uc.fb r2, r3, 0x8000 Unordered Concurrent Fixed Bound Cornell University Shreesha Srinath 7 / 31

  18. Motivation • XLOOPS ISA • XLOOPS Compiler XLOOPS Microarchitecture Evaluation XLOOPS Instruction Set Extensions XLOOP Instruction xloop.{d}.{c} rI, rN, L Data Control Induction Loop Loop Dependence Dependence Variable Bound Label xloop.uc.fb r2, r3, 0x8000 Unordered Concurrent Fixed Bound Cross-Iteration Instructions addiu.xi rX, imm addu.xi rX, rT Variables that can be computed as linear functions of the induction variable Cornell University Shreesha Srinath 7 / 31

  19. Motivation • XLOOPS ISA • XLOOPS Compiler XLOOPS Microarchitecture Evaluation XLOOPS ISA: Unordered Concurrent Element-wise Vector Multiplication for ( i=0; i<N; i++ ) C[i] = A[i] * B[i] RISC ISA loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu rA, rA, 4 addiu rB, rB, 4 addiu rC, rC, 4 addiu r1, r1, 1 bne r1, rN, loop Cornell University Shreesha Srinath 8 / 31

  20. Motivation • XLOOPS ISA • XLOOPS Compiler XLOOPS Microarchitecture Evaluation XLOOPS ISA: Unordered Concurrent Element-wise Vector Iteration 0 Iteration 1 Iteration 2 Iteration 3 Multiplication inst0 inst0 inst0 inst0 inst1 inst1 inst1 inst1 inst2 inst2 inst2 inst2 inst3 inst3 inst3 inst3 for ( i=0; i<N; i++ ) ... ... ... ... xloop.uc xloop.uc xloop.uc xloop.uc C[i] = A[i] * B[i] RISC ISA loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu rA, rA, 4 addiu rB, rB, 4 addiu rC, rC, 4 addiu r1, r1, 1 bne r1, rN, loop Cornell University Shreesha Srinath 8 / 31

Recommend


More recommend