Architectural Support for Speculative Precomputation Dean Tullsen - PowerPoint PPT Presentation

Architectural Support for Speculative Precomputation Dean Tullsen UCSD on sabbatical at UPC

Background -- Three Types of Helper Threads � Cache Prefetching � Branch Precomputation � Other � What architectural support you need/want depends on what your helper thread is doing Speculative Precomputation Tutorial

Background – Helper Threads as Non-traditional Parallelism � Traditional Parallelism – We use extra threads/processors to offload computation . Threads divide up the execution stream. � Helper threads – Extra threads are used to speed up computation without necessarily off-loading any of the original computation � Primary advantage � nearly any code, no matter how inherently serial, can benefit from parallelization. Speculative Precomputation Tutorial

Traditional Parallelism Thread 1 Thread 2 Thread 3 Thread 4 Speculative Precomputation Tutorial

Helper Thread Parallelism Thread 1 Thread 2 Thread 3 Thread 4 Speculative Precomputation Tutorial

Speculative Precomputation Trigger instruction Spawn thread Prefetch Memory latency Delinquent load Speculative Precomputation Tutorial

Other Helper Thread Models � For a description of helper threads that do not derive their code from the original thread, see: � Chappell, Stark, Kim, Reinhardt, Patt, "Simultaneous Subordinate Microthreading (SSMT),“ ISCA 26 Speculative Precomputation Tutorial

Helper Thread Model Trigger Point Spawn thread Copy Live-Ins Prefetch Memory latency Delinquent load Speculative Precomputation Tutorial

Cache Prefetching Architectural Support – Minimum � None. Cache is a shared structure. One thread can bring in a cache line that is needed by another as a side effect. Load A/Prefetch A Load A Speculative Precomputation Tutorial

Cache Prefetching Architectural Support – Useful Trigger Point � fast thread spawns Copy Live-Ins � support for live-in transfer � automatic triggering � directed prefetches Prefetch � thread management � thread creation! � retention of computation latency Del. load Speculative Precomputation Tutorial

Branch Precomputation Architectural Support – Minimum � Although branch predictor is (possibly) shared, depending on branch side effects is ineffective. BEQ R1, R2, label BEQ R1, R2, label Speculative Precomputation Tutorial

Branch Precomputation Architectural Support – Minimum � ISA support � Outcome storage � Correlator � Ability to override branch predictor Speculative Precomputation Tutorial

Branch Precomputation Architectural Support – Useful � fast thread spawns � support for live-in transfer � automatic triggering � thread management � thread creation � retention of computation Speculative Precomputation Tutorial

Arch Support for other types of helper threads � Access to hardware structures � branch predictor, BTB � trace cache � TLB � Caches � Triggering � branch predictor, BTB � value profiler � trace cache � TLB � Caches Speculative Precomputation Tutorial

Outline -- Architectural Support for Helper Threads � Branch Correlation Support � Dynamic Speculative Precomputation -- Arch Support for Creation and Management of Helper Threads � Register Integration – Arch Support for Reuse of Values in Helper Threads Speculative Precomputation Tutorial

Prediction/Branch Correlation � This material heavily based on: � Zilles, Sohi, “Execution-Based Prediction Using Speculative Slices” ISCA 28 Speculative Precomputation Tutorial

The Problem � Even assuming the ability to match � Even assuming the ability to match instructions in the helper thread with instructions in the helper thread with branch PCs in the main thread, branch PCs in the main thread, we still must correlate dynamic instances of the helper thread predictions with dynamic instances of the branch in the main thread. Speculative Precomputation Tutorial

Overall Solution � Tagged Prediction Queue PC Tag NT T NT NT T PC Tag NT T NT NT T T NT PC Tag NT T NT NT PC Tag NT T NT NT T T NT Speculative Precomputation Tutorial

Challenges � Re-ordering predictions produced out of order � allocate entries at fetch of prediction generating instruction � Main Thread (MT) Mis-speculation recovery � consume at fetch of MT branch, free at commit � Late predictions � MT must still consume empty entries, possibly establishing correlation with in-flight prediction � Conditionally-executed branches Speculative Precomputation Tutorial

Conditionally Executed Branches � Issue – Helper threads A typically contain no control flow (except B F maybe a single loop C back), and thus will generate a prediction D for every iteration. E Speculative Precomputation Tutorial

Conditionally Executed Branches � Do not want to introduce control flow into the slice to conditionally consume predictions. � Key – since the helper thread produces a prediction every iteration, we just consume one every iteration. � Zilles and Sohi used fetch PC’s to determine when a prediction should be killed (consumed). Could also use explicit instructions in main thread. Speculative Precomputation Tutorial

Conditionally Executed Branches A Slice Kill B F C D PC Tag NT T NT NT T T NT Loop Iteration Kill E Speculative Precomputation Tutorial

Outline -- Architectural Support for Helper Threads � Branch Correlation Support � Dynamic Speculative Precomputation -- Arch Support for Creation and Management of Helper Threads � Register Integration – Arch Support for Reuse of Values in Helper Threads Speculative Precomputation Tutorial

Why Create Threads in Hardware – Why Dynamic Speculative Precomputation? � SW Speculative Precomputation provides significant speedup, but � Requires offline program analysis � Creates threads for fixed number of thread contexts � Does not target existing code � Platform specific code � A completely hardware-based version will use dynamic program analysis via back-end instruction analyzers. Speculative Precomputation Tutorial

Example SMT Processor Pipeline Re-order Buffer Re-order Buffer Re-order Buffer Re-order Buffer Centralized Data Register Instruction Cache Renaming Queue Monolithic PC Execution ICache PC Register PC Units PC File Speculative Precomputation Tutorial

√ Identify delinquent loads Modified Pipeline Construct P-slices Spawn and manage P-slices Delinquent Load Re-order Buffer Re-order Buffer Identification Re-order Buffer Re-order Buffer Table (DLIT) Centralized Data Register Instruction Cache Renaming Queue Monolithic PC Execution ICache PC Register PC Units PC File Speculative Precomputation Tutorial

Delinquent Load Identification Table � Identify PCs of program’s delinquent loads � Entries allocated to PC of loads which missed in L2 cache on last execution � First-come, first-serve � Entry tracks average load behavior � After 128k total instructions, evaluated for delinquency Summary – finds delinquent loads Speculative Precomputation Tutorial

√ Identify delinquent loads Modified Pipeline Construct P-slices Spawn and manage P-slices Delinquent Load Re-order Buffer Re-order Buffer Identification Re-order Buffer Re-order Buffer Table (DLIT) Centralized Data Register Instruction Cache Renaming Queue Monolithic PC Execution ICache PC Register PC Units PC File Speculative Precomputation Tutorial

√ Identify delinquent loads Modified Pipeline √ Construct P-slices Spawn and manage P-slices Retired Delinquent Load Re-order Buffer Re-order Buffer Instruction Identification Re-order Buffer Re-order Buffer Buffer (RIB) Table (DLIT) Centralized Data Register Instruction Cache Renaming Queue Monolithic PC Execution ICache PC Register PC Units PC File Speculative Precomputation Tutorial

Retired Instruction Buffer � Construct p-slices to prefetch delinquent loads � Buffers information on an in-order run of committed instructions � Comparable to trace cache fill unit � FIFO structure � RIB normally idle (> 99% of the time) � We’ll spend more time on this. Speculative Precomputation Tutorial

√ Identify delinquent loads Modified Pipeline √ Construct P-slices Spawn and manage P-slices Retired Delinquent Load Re-order Buffer Re-order Buffer Instruction Identification Re-order Buffer Re-order Buffer Buffer (RIB) Table (DLIT) Centralized Data Register Instruction Cache Renaming Queue Monolithic PC Execution ICache PC Register PC Units PC File Speculative Precomputation Tutorial

√ Identify delinquent loads Modified Pipeline √ Construct P-slices √ Spawn and manage P-slices Retired Delinquent Load Re-order Buffer Re-order Buffer Instruction Identification Re-order Buffer Re-order Buffer Buffer (RIB) Table (DLIT) Centralized Slice Data Register Instruction Information Cache Renaming Queue Table (SIT) Monolithic PC Execution ICache PC Register PC Units PC File Speculative Precomputation Tutorial

Architectural Support for Speculative Precomputation Dean Tullsen - PowerPoint PPT Presentation

Architectural Support for Speculative Precomputation Dean Tullsen UCSD on sabbatical at UPC Background -- Three Types of Helper Threads Cache Prefetching Branch Precomputation Other What architectural support you need/want

OBAMA PRESIDENTIAL CENTER INTRODUCTION 2 INTRODUCTION 3 ARCHITECTURAL DESIGN 4 ARCHITECTURAL

Religious Architectural Religious Architectural Religious Architectural Religious Architectural

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

Harmonizing Speculative and Non-Speculative Execution in Architectures for Ordered Parallelism

Speculative Defragmentation Speculative Defragmentation A Technique to Improve the

NES Architectural Ltd http://www.nes-solutions.co.uk/architectural Who Are we? NES Architectural

1 View Materialization (Precomputation) Suppose we precompute RegionalSales and store it

Flexible Flexible Architectural Architectural Support Support for for Fine Fine Grain rain

Basics Architectural Presentation Basics Architectural Presentation Filesize: 6.51 MB Reviews

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Banking software architecture 2 Architectural Styles 1 WebLogic Network Gatekeeper's software

A class of precomputation-based distance-bounding protocols Jorge Toro-Pozo University of

Near-exhaustive Precomputation of Secondary Cloth Effects Doyub Kim 1,3 , Woojong Koh 2 , Rahul

Accelerating Relative-error Bounded Lossy Compression for HPC datasets with Precomputation-Based

Non-uniform cracks in the concrete: the power of free precomputation Daniel J. Bernstein

Non-uniform cracks in the concrete: the power of free precomputation D. J. Bernstein University

Recursion Fundamentals of Computer Science Outline Recursion A method calling itself

Before we begin Paper summaries for today? Intro to Light Announcement Announcement

ElmerGUI Tutorials Peter Rback ElmerTeam CSC IT Center for Science Warsaw, 21-22.10.2014

applications Olivier Petit and Hkan Nilsson Chalmers University of Technology Gothenbourg

Zur militrischen Nutzung der Knstlichen Intelligenz: Ethische, vlkerrechtliche und technische

Magistrates Survey Results 19/02/2019 1 The Survey The Survey: Issued in December 2017

The Big Picture Related Work Practical Considerations The Boundaries of Our Methods Code

Inferring and Executing Programs for Visual Reasoning Justin Johnson , Bharath Hariharan, Laurens