Explicit Loop Specialization & Polymorphic Hardware - PowerPoint PPT Presentation

Explicit Loop Specialization & Polymorphic Hardware Specialization Christopher Batten and Zhiru Zhang Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University June 2015

Explicit Loop Specialization & Polymorphic Hardware Specialization Design Performance Energy Efficiency (Tasks per Joule) App Specific w/ HLS Constraint n o i t a z Embedded i l a i c Architectures e p S . Programmable s v Accel y t i l i b i x e Design Power l F Constraint General Purpose Processor High-Performance Architectures Performance (Tasks per Second) Cornell University Christopher Batten 2 / 10

Explicit Loop Specialization & Polymorphic Hardware Specialization void ordered_merge( int* out, int* in_0, int* end_0, int* in_1, int* end_1 ) { while ( (in_0 != end_0) && (in_1 != end_1) ) { *out++ = ( *in_0 < *in_1 ) ? *in_0 : *in_1; ++in_0; ++in_1; } } Programmable Accelerator Application-Specific HLS Lane Manager GPP GPP LD ST ST Mem Xbar L1 Memory System L1 Memory System Cornell University Christopher Batten 3 / 10

Explicit Loop Specialization & Polymorphic Hardware Specialization XLOOPS: Explicit Loop Specialization [ MICRO’14 ] #pragma xloops unordered Lane Management Unit Lane Management Unit GPP for ( i=0; i<N; i++ ) DBN Adaptive C[i] = A[i] * B[i] GPR RF Lane Lane Lane 32 × 32b 0 1 3 IDQ IDQ IDQ Execution 2r2w loop: Inst Buf Inst Buf Inst Buf 128× 128× 128× lw r2, 0(rA) SLFU lw r3, 0(rB) Lane RF Lane RF Lane RF 24 × 32b 24 × 32b 24 × 32b 2r2w 2r2w 2r2w mul r4, r2, r3 CIB 8× CIB 8× CIB 8× sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 SLFU SLFU SLFU addiu.xi rC, 4 LLFU addiu r1, r1, 1 LSQ LSQ LSQ 16× 16× 16× xloop.uc r1, rN, loop Unordered Atomic D$ Request/Response Crossbar Ordered-Through-Registers L1 I$ 16 KB L1 D$ 16 KB Ordered-Through-Memory L2 Request and Response Crossbars Fixed vs Dynamic Bound Cornell University Christopher Batten 4 / 10

Explicit Loop Specialization & Polymorphic Hardware Specialization XLOOPS Energy-Efficiency vs. Performance Results In-order+LPSU OOO 2-way+LPSU OOO 4-way+LPSU vs. In-order Core vs. OOO 2-Way vs. OOO 4-Way 3.5 Normalized Energy Efficiency 3.0 2.5 2.0 1.5 1.0 0.5 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 2.5 Normalized Performance Normalized Performance Normalized Performance ◮ XLOOPS vs. Simple Core : Similar energy efficiency, higher power ◮ XLOOPS vs. OOO 2-way : Higher energy efficiency, mixed power ◮ XLOOPS vs. OOO 4-way : Higher energy efficiency, lower power ◮ Adaptive execution trades energy efficiency for performance ◮ Profiling and migration cause minimal performance degredation Cornell University Christopher Batten 5 / 10

Explicit Loop Specialization & Polymorphic Hardware Specialization Design Performance Energy Efficiency (Tasks per Joule) App Specific w/ HLS Constraint n o i t a z Embedded i PolyHS l a i c Architectures e p S . Programmable s v Accel y t i l i b i x e Design Power l F Constraint General Purpose Processor High-Performance Architectures Performance (Tasks per Second) Cornell University Christopher Batten 6 / 10

Explicit Loop Specialization & Polymorphic Hardware Specialization PolyHS: Polymorphic Hardware Specialization ◮ Software engineers also want to create specialized yet flexible pieces of software to improve code efficiency and reduce design complexity. ◮ Software engineers develop carefully crafted libraries of algorithms and data structures that are composible and polymorphic over the types of values and/or functors. template < typename Itr0, typename Itr1, typename Itr2, typename Cmp > void ordered_merge( Itr0 out, Itr1 in_0, Itr1 end_0, Itr2 in_1, Itr2 end_1, Cmp cmp ) { while ( (in_0 != end_0) & (in_1 != end_1) ) { *out++ = ( cmp( *in_0, *in_1 ) ) ? *in_0 : *in_1; ++in_0; ++in_1; } } How can we systematically (and automatically?) generate hardware specialization at design time that supports compile-time polymorphism? Cornell University Christopher Batten 7 / 10

Explicit Loop Specialization & Polymorphic Hardware Specialization PolyHS Methodology Library of HW Toolflow SW Toolflow Application Alogrithms code code and Data Structures Architecture Poly-HS Poly-HS Description Synthesis Compilation Software Poly-HS Application Binary Stubs Architecture Library of RTL for Poly-HS Full-Chip RTL Poly-ASUs/DSUs Run-Time System and Simulators Standard ASIC Poly-HS Chip CAD Toolflow Cornell University Christopher Batten 8 / 10

Explicit Loop Specialization & Polymorphic Hardware Specialization PolyHS Architecture Poly-HS Chip Poly-Tile GPP Poly Poly Poly Poly Poly Config Crossbar CP Tile Tile Tile ASU ASU On-Chip Networks Crossbar L1 I$ Poly Poly Poly Poly Poly GPP Tile Tile Tile DSU DSU Memory Crossbar Carefully designed iterator-based abstraction enables composition of L1 Data Cache algorithm and data-structure specialization units Memory Network Interface Specialization units configured with metadata describing data types and functors Cornell University Christopher Batten 9 / 10

Explicit Loop Specialization & Polymorphic Hardware Specialization Design Performance Energy Efficiency (Tasks per Joule) App Specific w/ HLS Constraint n o i t a z Embedded i PolyHS l a i c Architectures e p S . Programmable s v Accel y t i l i b i x e Design Power l F Constraint General Purpose Processor High-Performance Architectures Performance (Tasks per Second) Cornell University Christopher Batten 10 / 10

Explicit Loop Specialization & Polymorphic Hardware - PowerPoint PPT Presentation

Explicit Loop Specialization & Polymorphic Hardware Specialization Christopher Batten and Zhiru Zhang Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University June 2015 Explicit Loop Specialization

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

Repetition Types of Loops Counting loop Know how many times to loop

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

The Didactics of Science The Didactics of Science Through Polymorphic Polymorphic Self Self- -

Polymorphic & Metamorphic Viruses CS4440/7440 Spring 2015 Evolution of Polymorphic Viruses

This time on Types ... Polymorphic -calculus (polymorphic -binding). Lets us type: f ((

Polymorphic Lists & Trees Department of Computer Science University of Maryland, College Park

Polymorphic types Polymorphic -calculus (System F) Simply typed -calculus is

EXPLICIT INSTRUCTION EXPLICIT INSTRUCTION Michael L. Kamil Michael L. Kamil Stanford University

The explicit teaching of a The explicit teaching of a The explicit teaching of a laboratory

MOBILE COMPUTING CSE 40814/60814 Fall 2015 System Structure explicit explicit input output 1

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Thinking Like a Chemist About Electrochemistry III Potential, Free Energy & K UNIT 8 DAY 5

Quest-V: A Secure and Predictable System for IoT and Beyond Richard West richwest@cs.bu.edu

Chemical Thermodynamics Joule-Thompson Expansion Joule-Thompson expansion depends on non-ideal

Unit 4: Energy and power. Direct current circuits. Joule heating. Discharging process of a

How is Light Made? How is Light Made? Can be considered as EITHER particles ( photons ) or as

HW1 Graded Ready in your pendaflexes If you forget to include his name, e-mail Brandon.

Interconnect Design Sachin S. Sapatnekar University of Minnesota Acknowledgments Vivek Mishra

CIS 371 Computer Organization and Design Unit 13: Power & Energy Slides developed by

Explicit Loop Specialization & Polymorphic Hardware - PowerPoint PPT Presentation

Explicit Loop Specialization & Polymorphic Hardware Specialization Christopher Batten and Zhiru Zhang Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University June 2015 Explicit Loop Specialization

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

Repetition Types of Loops Counting loop Know how many times to loop

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

The Didactics of Science The Didactics of Science Through Polymorphic Polymorphic Self Self- -

Polymorphic &amp; Metamorphic Viruses CS4440/7440 Spring 2015 Evolution of Polymorphic Viruses

This time on Types ... Polymorphic -calculus (polymorphic -binding). Lets us type: f ((

Polymorphic Lists &amp; Trees Department of Computer Science University of Maryland, College Park

Polymorphic types Polymorphic -calculus (System F) Simply typed -calculus is

EXPLICIT INSTRUCTION EXPLICIT INSTRUCTION Michael L. Kamil Michael L. Kamil Stanford University

The explicit teaching of a The explicit teaching of a The explicit teaching of a laboratory

MOBILE COMPUTING CSE 40814/60814 Fall 2015 System Structure explicit explicit input output 1

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Thinking Like a Chemist About Electrochemistry III Potential, Free Energy &amp; K UNIT 8 DAY 5

Quest-V: A Secure and Predictable System for IoT and Beyond Richard West richwest@cs.bu.edu

Chemical Thermodynamics Joule-Thompson Expansion Joule-Thompson expansion depends on non-ideal

Unit 4: Energy and power. Direct current circuits. Joule heating. Discharging process of a

How is Light Made? How is Light Made? Can be considered as EITHER particles ( photons ) or as

HW1 Graded Ready in your pendaflexes If you forget to include his name, e-mail Brandon.

Interconnect Design Sachin S. Sapatnekar University of Minnesota Acknowledgments Vivek Mishra

CIS 371 Computer Organization and Design Unit 13: Power &amp; Energy Slides developed by

Polymorphic & Metamorphic Viruses CS4440/7440 Spring 2015 Evolution of Polymorphic Viruses

Polymorphic Lists & Trees Department of Computer Science University of Maryland, College Park

Thinking Like a Chemist About Electrochemistry III Potential, Free Energy & K UNIT 8 DAY 5

CIS 371 Computer Organization and Design Unit 13: Power & Energy Slides developed by