Everything should be made as simple as possible, but not simpler—Albert Einstein LogCA: A High-Level Performance Model for Hardware Accelerators Muhammad Shoaib Bin Altaf* David A. Wood University of Wisconsin-Madison * Now at AMD Research, Austin TX
Executive Summary • Accelerators do not always perform as expected • Crucial for programmers and architects to understand the factors which affect performance • Simple analytical models beneficial early in the design stage • Our proposal: LogCA – High-level performance model – Help identify design bottlenecks and possible optimizations • Validation across variety of on-chip and off-chip accelerators • Two retrospective case studies demonstrate the usefulness of the model 2
Outline • Motivation • LogCA • Results • Conclusion 3
Why Need a Model? “An accelerator is a separate architectural substructure ... that is architected using a different set of objectives than the base processor, ...., the accelerator is tuned to provide HIGHER PERFORMANCE ….. than with the general-purpose base hardware” M7: Next Generation SPARC Hotchips-26 2014 Power8 Hpctchips-25 2013 S. Patel and W. Hwu. Accelerators Architectures. Micro 2008 4
Why a Model? Encryption algorithm on UltraSPARC T2 10 Accelerator outperforms Host outperforms 1 Break-even point Time (ms) Better 0.1 Host Accelerator 0.01 0.001 Block Size (Bytes) Amdahl’s Law for Accelerators 5
Why a Model? 100 UltraSPARC T2 Advanced Encryption Standard (AES) SPARC T4 10 Speedup Better GPU Break-even points 1 0.1 Offloaded Data (Bytes) Running the same kernel, accelerators can have different break-even points 6
Outline • Motivation • LogCA • Results • Conclusion 7
The Performance Model • Inspired by LogP [CACM 1996] • Abstract accelerator using five parameters Accelerator Host – L Latency: Cycles to move data – o Overhead: Setup cost – g Granularity: Size of the off-loaded data Interface – C Computational index: Amount of work done per byte of data – A Acceleration: Speedup ignoring overheads • Sixth parameter 𝜸 generalizes to kernels with non-linear complexity 8
The Performance Model • Execution w/o an accelerator Accelerator Host – T 0 (g) = C 0 (g) • Execution with one accelerator – T 1 (g) = o 1 (g) + L 1 (g) + C 1 (g) Interface T 0 (g) C 0 (g) time o 1 (g) L 1 (g) # $ (&) C 1 (g)= ( T 1 (g) Gain 9
Granularity independent latency • Captures the effect of granularity on speedup A • Speedup bounded by acceleration 10 – lim &→- 𝑇𝑞𝑓𝑓𝑒𝑣𝑞 = 𝐵 Speedup (g) • Overheads dominate at smaller granularities 1 # # – 𝑇𝑞𝑓𝑓𝑒𝑣𝑞() &67 = < 89:9 ; 89: 0.1 < Granularity (Bytes) Amdahl’s law for Accelerators 10
Performance Metrics 100 • Right amount of off-loaded data? A • Inspired from vector machine metrics 𝑂 ? , 𝑂 @ 10 Speedup A ( 1 7 • 7 : Granularity for a speedup of 1 B – 7 is essentially independent of acceleration 0.1 Granularity (Bytes) 𝒉 𝟐 Small – Identify complexity of the interface Large 𝒉 𝟐 Simple Interface Complex Interface ( • < : Granularity for a speedup of B A – Increasing A also increases < A 11
Granularity dependent latency A • Speedup bounded by computational intensity C/L 10 # 𝐷 – lim &→- 𝑇𝑞𝑓𝑓𝑒𝑣𝑞 < : (𝑚𝑗𝑜𝑓𝑏𝑠 𝑏𝑚𝑝𝑠𝑗𝑢ℎ𝑛𝑡) Speedup (g) 𝑀 1 • Speedup for sub-linear algorithms asymptotically 0.1 decreases with the increase in granularity Granularity (Bytes) A 10 Speedup (g) Sub-linearly 1 g Speedup 0.1 Granularity (Bytes) Linearly 12
Granularity dependent latency • Computational intensity must be greater A than 1 to achieve any speedup 𝐷 A/2 Speedup 𝐷 𝑀 ≥ 𝐵 𝑀 ≥ 1 • Computational intensity should be greater 1 than peak performance to achieve A/2 7 A/2 Granularity (Bytes) Performance metrics help programmers early in the design cycle 13
Bottleneck Analysis using LogCA • 10X change in parameter è 20% performance gain • Helps focus on performance bottlenecks oC oCA A 1000 𝐷 𝑀 ⁄ 100 oC A Speedup LogCA 10 L_0.1x oC oCA A o_0.1x 1 C_10x A A_10x 0.1 Granularity (Bytes) 14
Outline • Motivation • LogCA • Results • Conclusion 15
Experimental Methodology • Fixed-function and general-purpose accelerators – Cryptographic accelerators on SPARC architectures – Discrete and integrated GPUs • Kernels with varying complexities – Encryption, Hashing, Matrix Multiplication, FFT, Search, Radix Sort • Retrospective case studies – Cryptographic interface in SPARC architectures – Memory interface in GPUs 16
Case Study I Cryptographic Interface in the SPARC Architecture UltraSPARC T2 PCIe Crypto Accelerator SPARC T4 engine SPARC T3 SPARC T4 instructions 17
Conclusion • Simple models effective in predicting performance of accelerators • Proposed a high-level performance model for hardware accelerators • These models help programmers and architects visually identify bottlenecks and suggest optimizations • Performance metrics for programmers in deciding the right amount of offloaded data • Limitations include inability to model resource contention, caches, and irregular memory access patterns 18
Questions? Source: http://www.medarcade.com/ 19
Recommend
More recommend