Explicit Loop Specialization & Polymorphic Hardware Specialization Christopher Batten and Zhiru Zhang Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University June 2015
Explicit Loop Specialization & Polymorphic Hardware Specialization Design Performance Energy Efficiency (Tasks per Joule) App Specific w/ HLS Constraint n o i t a z Embedded i l a i c Architectures e p S . Programmable s v Accel y t i l i b i x e Design Power l F Constraint General Purpose Processor High-Performance Architectures Performance (Tasks per Second) Cornell University Christopher Batten 2 / 10
Explicit Loop Specialization & Polymorphic Hardware Specialization void ordered_merge( int* out, int* in_0, int* end_0, int* in_1, int* end_1 ) { while ( (in_0 != end_0) && (in_1 != end_1) ) { *out++ = ( *in_0 < *in_1 ) ? *in_0 : *in_1; ++in_0; ++in_1; } } Programmable Accelerator Application-Specific HLS Lane Manager GPP GPP LD ST ST Mem Xbar L1 Memory System L1 Memory System Cornell University Christopher Batten 3 / 10
Explicit Loop Specialization & Polymorphic Hardware Specialization XLOOPS: Explicit Loop Specialization [ MICRO’14 ] #pragma xloops unordered Lane Management Unit Lane Management Unit GPP for ( i=0; i<N; i++ ) DBN Adaptive C[i] = A[i] * B[i] GPR RF Lane Lane Lane 32 × 32b 0 1 3 IDQ IDQ IDQ Execution 2r2w loop: Inst Buf Inst Buf Inst Buf 128× 128× 128× lw r2, 0(rA) SLFU lw r3, 0(rB) Lane RF Lane RF Lane RF 24 × 32b 24 × 32b 24 × 32b 2r2w 2r2w 2r2w mul r4, r2, r3 CIB 8× CIB 8× CIB 8× sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 SLFU SLFU SLFU addiu.xi rC, 4 LLFU addiu r1, r1, 1 LSQ LSQ LSQ 16× 16× 16× xloop.uc r1, rN, loop Unordered Atomic D$ Request/Response Crossbar Ordered-Through-Registers L1 I$ 16 KB L1 D$ 16 KB Ordered-Through-Memory L2 Request and Response Crossbars Fixed vs Dynamic Bound Cornell University Christopher Batten 4 / 10
Explicit Loop Specialization & Polymorphic Hardware Specialization XLOOPS Energy-Efficiency vs. Performance Results In-order+LPSU OOO 2-way+LPSU OOO 4-way+LPSU vs. In-order Core vs. OOO 2-Way vs. OOO 4-Way 3.5 Normalized Energy Efficiency 3.0 2.5 2.0 1.5 1.0 0.5 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 2.5 Normalized Performance Normalized Performance Normalized Performance ◮ XLOOPS vs. Simple Core : Similar energy efficiency, higher power ◮ XLOOPS vs. OOO 2-way : Higher energy efficiency, mixed power ◮ XLOOPS vs. OOO 4-way : Higher energy efficiency, lower power ◮ Adaptive execution trades energy efficiency for performance ◮ Profiling and migration cause minimal performance degredation Cornell University Christopher Batten 5 / 10
Explicit Loop Specialization & Polymorphic Hardware Specialization Design Performance Energy Efficiency (Tasks per Joule) App Specific w/ HLS Constraint n o i t a z Embedded i PolyHS l a i c Architectures e p S . Programmable s v Accel y t i l i b i x e Design Power l F Constraint General Purpose Processor High-Performance Architectures Performance (Tasks per Second) Cornell University Christopher Batten 6 / 10
Explicit Loop Specialization & Polymorphic Hardware Specialization PolyHS: Polymorphic Hardware Specialization ◮ Software engineers also want to create specialized yet flexible pieces of software to improve code efficiency and reduce design complexity. ◮ Software engineers develop carefully crafted libraries of algorithms and data structures that are composible and polymorphic over the types of values and/or functors. template < typename Itr0, typename Itr1, typename Itr2, typename Cmp > void ordered_merge( Itr0 out, Itr1 in_0, Itr1 end_0, Itr2 in_1, Itr2 end_1, Cmp cmp ) { while ( (in_0 != end_0) & (in_1 != end_1) ) { *out++ = ( cmp( *in_0, *in_1 ) ) ? *in_0 : *in_1; ++in_0; ++in_1; } } How can we systematically (and automatically?) generate hardware specialization at design time that supports compile-time polymorphism? Cornell University Christopher Batten 7 / 10
Explicit Loop Specialization & Polymorphic Hardware Specialization PolyHS Methodology Library of HW Toolflow SW Toolflow Application Alogrithms code code and Data Structures Architecture Poly-HS Poly-HS Description Synthesis Compilation Software Poly-HS Application Binary Stubs Architecture Library of RTL for Poly-HS Full-Chip RTL Poly-ASUs/DSUs Run-Time System and Simulators Standard ASIC Poly-HS Chip CAD Toolflow Cornell University Christopher Batten 8 / 10
Explicit Loop Specialization & Polymorphic Hardware Specialization PolyHS Architecture Poly-HS Chip Poly-Tile GPP Poly Poly Poly Poly Poly Config Crossbar CP Tile Tile Tile ASU ASU On-Chip Networks Crossbar L1 I$ Poly Poly Poly Poly Poly GPP Tile Tile Tile DSU DSU Memory Crossbar Carefully designed iterator-based abstraction enables composition of L1 Data Cache algorithm and data-structure specialization units Memory Network Interface Specialization units configured with metadata describing data types and functors Cornell University Christopher Batten 9 / 10
Explicit Loop Specialization & Polymorphic Hardware Specialization Design Performance Energy Efficiency (Tasks per Joule) App Specific w/ HLS Constraint n o i t a z Embedded i PolyHS l a i c Architectures e p S . Programmable s v Accel y t i l i b i x e Design Power l F Constraint General Purpose Processor High-Performance Architectures Performance (Tasks per Second) Cornell University Christopher Batten 10 / 10
Recommend
More recommend