Explicit Loop Specialization & Polymorphic Hardware - - PowerPoint PPT Presentation
Explicit Loop Specialization & Polymorphic Hardware - - PowerPoint PPT Presentation
Explicit Loop Specialization & Polymorphic Hardware Specialization Christopher Batten and Zhiru Zhang Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University June 2015 Explicit Loop Specialization
Explicit Loop Specialization & Polymorphic Hardware Specialization
Performance (Tasks per Second) Energy Efficiency (Tasks per Joule) General Purpose Processor
Design Power Constraint High-Performance Architectures Embedded Architectures Design Performance Constraint F l e x i b i l i t y v s . S p e c i a l i z a t i
- n
App Specific w/ HLS Programmable Accel
Cornell University Christopher Batten 2 / 10
Explicit Loop Specialization & Polymorphic Hardware Specialization
void ordered_merge( int* out, int* in_0, int* end_0, int* in_1, int* end_1 ) { while ( (in_0 != end_0) && (in_1 != end_1) ) { *out++ = ( *in_0 < *in_1 ) ? *in_0 : *in_1; ++in_0; ++in_1; } }
LD ST GPP Lane Manager Mem Xbar L1 Memory System GPP L1 Memory System ST
Application-Specific HLS Programmable Accelerator
Cornell University Christopher Batten 3 / 10
Explicit Loop Specialization & Polymorphic Hardware Specialization
XLOOPS: Explicit Loop Specialization [MICRO’14]
#pragma xloops unordered for ( i=0; i<N; i++ ) C[i] = A[i] * B[i] loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop
Unordered Atomic Ordered-Through-Registers Ordered-Through-Memory Fixed vs Dynamic Bound
GPR RF 32 × 32b 2r2w
GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1
Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×
Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ CIB 8× CIB 8× CIB 8× LSQ 16× LSQ 16× LSQ 16× DBN Lane Management Unit
Adaptive Execution
Cornell University Christopher Batten 4 / 10
Explicit Loop Specialization & Polymorphic Hardware Specialization
XLOOPS Energy-Efficiency vs. Performance Results
In-order+LPSU
- vs. In-order Core
OOO 2-way+LPSU
- vs. OOO 2-Way
OOO 4-way+LPSU
- vs. OOO 4-Way
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Normalized Performance
0.5 1.0 1.5 2.0 2.5 3.0 3.5
Normalized Energy Efficiency
0.5 1.0 1.5 2.0 2.5 3.0
Normalized Performance
0.5 1.0 1.5 2.0 2.5
Normalized Performance
◮ XLOOPS vs. Simple Core : Similar energy efficiency, higher power ◮ XLOOPS vs. OOO 2-way : Higher energy efficiency, mixed power ◮ XLOOPS vs. OOO 4-way : Higher energy efficiency, lower power ◮ Adaptive execution trades energy efficiency for performance ◮ Profiling and migration cause minimal performance degredation
Cornell University Christopher Batten 5 / 10
Explicit Loop Specialization & Polymorphic Hardware Specialization
Performance (Tasks per Second) Energy Efficiency (Tasks per Joule) General Purpose Processor
Design Power Constraint High-Performance Architectures Embedded Architectures Design Performance Constraint F l e x i b i l i t y v s . S p e c i a l i z a t i
- n
App Specific w/ HLS Programmable Accel PolyHS
Cornell University Christopher Batten 6 / 10
Explicit Loop Specialization & Polymorphic Hardware Specialization
PolyHS: Polymorphic Hardware Specialization
◮ Software engineers also want to create specialized yet flexible pieces
- f software to improve code efficiency and reduce design complexity.
◮ Software engineers develop carefully crafted libraries of algorithms
and data structures that are composible and polymorphic over the types of values and/or functors.
template < typename Itr0, typename Itr1, typename Itr2, typename Cmp > void ordered_merge( Itr0 out, Itr1 in_0, Itr1 end_0, Itr2 in_1, Itr2 end_1, Cmp cmp ) { while ( (in_0 != end_0) & (in_1 != end_1) ) { *out++ = ( cmp( *in_0, *in_1 ) ) ? *in_0 : *in_1; ++in_0; ++in_1; } }
How can we systematically (and automatically?) generate hardware specialization at design time that supports compile-time polymorphism?
Cornell University Christopher Batten 7 / 10
Explicit Loop Specialization & Polymorphic Hardware Specialization
PolyHS Methodology
Poly-HS Synthesis Library of RTL for Poly-ASUs/DSUs code Library of Alogrithms and Data Structures Software Stubs Poly-HS Architecture Full-Chip RTL and Simulators Architecture Description Standard ASIC CAD Toolflow Poly-HS Chip Poly-HS Compilation Application Binary code Application Poly-HS Run-Time System SW Toolflow HW Toolflow
Cornell University Christopher Batten 8 / 10
Explicit Loop Specialization & Polymorphic Hardware Specialization
PolyHS Architecture
L1 I$ GPP Poly Tile Poly Tile On-Chip Networks Poly ASU Poly ASU Poly DSU Poly DSU Config Crossbar Memory Crossbar L1 Data Cache CP Memory Network Interface Poly Tile GPP Poly Tile Poly Tile Poly Tile
Poly-HS Chip Poly-Tile
Crossbar
Carefully designed iterator-based abstraction enables composition of algorithm and data-structure specialization units Specialization units configured with metadata describing data types and functors
Cornell University Christopher Batten 9 / 10
Explicit Loop Specialization & Polymorphic Hardware Specialization
Performance (Tasks per Second) Energy Efficiency (Tasks per Joule) General Purpose Processor
Design Power Constraint High-Performance Architectures Embedded Architectures Design Performance Constraint F l e x i b i l i t y v s . S p e c i a l i z a t i
- n
App Specific w/ HLS Programmable Accel PolyHS
Cornell University Christopher Batten 10 / 10