LogCA: A High-Level Performance Model for Hardware Accelerators - PowerPoint PPT Presentation

Everything should be made as simple as possible, but not simpler—Albert Einstein LogCA: A High-Level Performance Model for Hardware Accelerators Muhammad Shoaib Bin Altaf* David A. Wood University of Wisconsin-Madison * Now at AMD Research, Austin TX

Executive Summary • Accelerators do not always perform as expected • Crucial for programmers and architects to understand the factors which affect performance • Simple analytical models beneficial early in the design stage • Our proposal: LogCA – High-level performance model – Help identify design bottlenecks and possible optimizations • Validation across variety of on-chip and off-chip accelerators • Two retrospective case studies demonstrate the usefulness of the model 2

Outline • Motivation • LogCA • Results • Conclusion 3

Why Need a Model? “An accelerator is a separate architectural substructure ... that is architected using a different set of objectives than the base processor, ...., the accelerator is tuned to provide HIGHER PERFORMANCE ….. than with the general-purpose base hardware” M7: Next Generation SPARC Hotchips-26 2014 Power8 Hpctchips-25 2013 S. Patel and W. Hwu. Accelerators Architectures. Micro 2008 4

Why a Model? Encryption algorithm on UltraSPARC T2 10 Accelerator outperforms Host outperforms 1 Break-even point Time (ms) Better 0.1 Host Accelerator 0.01 0.001 Block Size (Bytes) Amdahl’s Law for Accelerators 5

Why a Model? 100 UltraSPARC T2 Advanced Encryption Standard (AES) SPARC T4 10 Speedup Better GPU Break-even points 1 0.1 Offloaded Data (Bytes) Running the same kernel, accelerators can have different break-even points 6

The Performance Model • Inspired by LogP [CACM 1996] • Abstract accelerator using five parameters Accelerator Host – L Latency: Cycles to move data – o Overhead: Setup cost – g Granularity: Size of the off-loaded data Interface – C Computational index: Amount of work done per byte of data – A Acceleration: Speedup ignoring overheads • Sixth parameter 𝜸 generalizes to kernels with non-linear complexity 8

The Performance Model • Execution w/o an accelerator Accelerator Host – T 0 (g) = C 0 (g) • Execution with one accelerator – T 1 (g) = o 1 (g) + L 1 (g) + C 1 (g) Interface T 0 (g) C 0 (g) time o 1 (g) L 1 (g) # $ (&) C 1 (g)= ( T 1 (g) Gain 9

Granularity independent latency • Captures the effect of granularity on speedup A • Speedup bounded by acceleration 10 – lim &→- 𝑇𝑞𝑓𝑓𝑒𝑣𝑞 𝑕 = 𝐵 Speedup (g) • Overheads dominate at smaller granularities 1 # # – 𝑇𝑞𝑓𝑓𝑒𝑣𝑞(𝑕) &67 = < 89:9 ; 89: 0.1 < Granularity (Bytes) Amdahl’s law for Accelerators 10

Performance Metrics 100 • Right amount of off-loaded data? A • Inspired from vector machine metrics 𝑂 ? , 𝑂 @ 10 Speedup A 𝑕 ( 1 𝑕 7 • 𝑕 7 : Granularity for a speedup of 1 B – 𝑕 7 is essentially independent of acceleration 0.1 Granularity (Bytes) 𝒉 𝟐 Small – Identify complexity of the interface Large 𝒉 𝟐 Simple Interface Complex Interface ( • 𝑕 < : Granularity for a speedup of B A – Increasing A also increases 𝑕 < A 11

Granularity dependent latency A • Speedup bounded by computational intensity C/L 10 # 𝐷 – lim &→- 𝑇𝑞𝑓𝑓𝑒𝑣𝑞 𝑕 < : (𝑚𝑗𝑜𝑓𝑏𝑠 𝑏𝑚𝑕𝑝𝑠𝑗𝑢ℎ𝑛𝑡) Speedup (g) 𝑀 1 • Speedup for sub-linear algorithms asymptotically 0.1 decreases with the increase in granularity Granularity (Bytes) A 10 Speedup (g) Sub-linearly 1 g Speedup 0.1 Granularity (Bytes) Linearly 12

Granularity dependent latency • Computational intensity must be greater A than 1 to achieve any speedup 𝐷 A/2 Speedup 𝐷 𝑀 ≥ 𝐵 𝑀 ≥ 1 • Computational intensity should be greater 1 than peak performance to achieve A/2 𝑕 7 𝑕 A/2 Granularity (Bytes) Performance metrics help programmers early in the design cycle 13

Bottleneck Analysis using LogCA • 10X change in parameter è 20% performance gain • Helps focus on performance bottlenecks oC oCA A 1000 𝐷 𝑀 ⁄ 100 oC A Speedup LogCA 10 L_0.1x oC oCA A o_0.1x 1 C_10x A A_10x 0.1 Granularity (Bytes) 14

Experimental Methodology • Fixed-function and general-purpose accelerators – Cryptographic accelerators on SPARC architectures – Discrete and integrated GPUs • Kernels with varying complexities – Encryption, Hashing, Matrix Multiplication, FFT, Search, Radix Sort • Retrospective case studies – Cryptographic interface in SPARC architectures – Memory interface in GPUs 16

Case Study I Cryptographic Interface in the SPARC Architecture UltraSPARC T2 PCIe Crypto Accelerator SPARC T4 engine SPARC T3 SPARC T4 instructions 17

Conclusion • Simple models effective in predicting performance of accelerators • Proposed a high-level performance model for hardware accelerators • These models help programmers and architects visually identify bottlenecks and suggest optimizations • Performance metrics for programmers in deciding the right amount of offloaded data • Limitations include inability to model resource contention, caches, and irregular memory access patterns 18

Questions? Source: http://www.medarcade.com/ 19

LogCA: A High-Level Performance Model for Hardware Accelerators - PowerPoint PPT Presentation

Everything should be made as simple as possible, but not simplerAlbert Einstein LogCA: A High-Level Performance Model for Hardware Accelerators Muhammad Shoaib Bin Altaf* David A. Wood University of Wisconsin-Madison * Now at AMD Research,

Hardware Observability Framework Hardware Observability Framework Hardware Observability

UN High UN High UN High UN High- - - -Level Meeting on TB Level Meeting on TB Level Meeting

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

High Performance Hardware, High Performance Hardware, Memory & CPU Memory & CPU Step

Formal Hardware Verification (some key ideas) Mary Sheeran Idealised Flow High level Not

PowerWizard Level 1.0 & Level 2.0 Control Systems Training Systems Comparison Level 2

Lecture 3 Hardware and Software 3. Hardware and Softw are 4. High Level Languages 5.

software and hardware for the Internet of Things. Choose hardware Design hardware Design

CS 134: Operating Systems I/O Hardware 1 / 23 Overview CS34 Overview 2013-05-17 Hardware

Completed Rehab of Level 1 and Level 3 Completed Bypass Adit and Entry into Level 1

CSC 2400: Computer Systems Towards the Hardware: Machine-Level Representation of Programs

Towards High- -performance performance Towards High Flow- -level Packet Processing level

Flexible Hardware Design at Flexible Hardware Design at Low Levels of Abstraction Low Levels of

The Problem of Temporal Abstraction How do we connect the high level to the low-level? "

Hardware evaluation and procurement Hardware: competition, evolution, Evaluation of CPU nodes

E-volution Meghan Mize, Carrie Ng, Mei Chang Project Statement A start up energy service

University of Delaware Service Center/Recharge Centers/Core Facilities June 13, 2017 Agenda 1.

4/29/14 Is there

Q3-2008 RESULTS 3 November 2008 Kurt Ritter, President & CEO Knut Kleiven, Deputy President

2016 Robert Purcell CEO Brian Tenner CFO 31 May 2016 www.renold.com Executive Summary STEP

Philips Lighting reports sales at 1.7 billion, continued profitability increase led by gross

Presentation for Investors SAPPE PUBLIC COMPANY LIMITED, THAILAND H1-2017 Sep 2017 AGENDA: 1.

DeliveringValue: NextGenMichigan ITLeadersConference June2012

LogCA: A High-Level Performance Model for Hardware Accelerators - PowerPoint PPT Presentation

Everything should be made as simple as possible, but not simplerAlbert Einstein LogCA: A High-Level Performance Model for Hardware Accelerators Muhammad Shoaib Bin Altaf* David A. Wood University of Wisconsin-Madison * Now at AMD Research,

Hardware Observability Framework Hardware Observability Framework Hardware Observability

UN High UN High UN High UN High- - - -Level Meeting on TB Level Meeting on TB Level Meeting

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

High Performance Hardware, High Performance Hardware, Memory &amp; CPU Memory &amp; CPU Step

Formal Hardware Verification (some key ideas) Mary Sheeran Idealised Flow High level Not

PowerWizard Level 1.0 &amp; Level 2.0 Control Systems Training Systems Comparison Level 2

Lecture 3 Hardware and Software 3. Hardware and Softw are 4. High Level Languages 5.

software and hardware for the Internet of Things. Choose hardware Design hardware Design

CS 134: Operating Systems I/O Hardware 1 / 23 Overview CS34 Overview 2013-05-17 Hardware

Completed Rehab of Level 1 and Level 3 Completed Bypass Adit and Entry into Level 1

CSC 2400: Computer Systems Towards the Hardware: Machine-Level Representation of Programs

Towards High- -performance performance Towards High Flow- -level Packet Processing level

Flexible Hardware Design at Flexible Hardware Design at Low Levels of Abstraction Low Levels of

The Problem of Temporal Abstraction How do we connect the high level to the low-level? &quot;

Hardware evaluation and procurement Hardware: competition, evolution, Evaluation of CPU nodes

E-volution Meghan Mize, Carrie Ng, Mei Chang Project Statement A start up energy service

University of Delaware Service Center/Recharge Centers/Core Facilities June 13, 2017 Agenda 1.

4/29/14 Is there

Q3-2008 RESULTS 3 November 2008 Kurt Ritter, President &amp; CEO Knut Kleiven, Deputy President

2016 Robert Purcell CEO Brian Tenner CFO 31 May 2016 www.renold.com Executive Summary STEP

Philips Lighting reports sales at 1.7 billion, continued profitability increase led by gross

Presentation for Investors SAPPE PUBLIC COMPANY LIMITED, THAILAND H1-2017 Sep 2017 AGENDA: 1.

DeliveringValue: NextGenMichigan ITLeadersConference June2012

High Performance Hardware, High Performance Hardware, Memory & CPU Memory & CPU Step

PowerWizard Level 1.0 & Level 2.0 Control Systems Training Systems Comparison Level 2

The Problem of Temporal Abstraction How do we connect the high level to the low-level? "

Q3-2008 RESULTS 3 November 2008 Kurt Ritter, President & CEO Knut Kleiven, Deputy President