AXCIS: Accelerating Architectural Exploration using Canonical - PowerPoint PPT Presentation

AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments Rose Liu & Krste Asanovi ć Computer Architecture Group MIT CSAIL 1

Simulation for Large Design Space Exploration � Large design space studies explore thousands of processor designs � Identify those that minimize costs and maximize performance Pareto-optimal designs on curve CPI Area � Speed vs. Accuracy tradeoff � Maximize simulation speedup while maintaining sufficient accuracy to identify interesting design points for later detailed simulation 2 of 32

Reduce Simulated Instructions: Sampling � Perform detailed microarchitectural simulation during sample points & functional warming between sample points � SimPoints [ASPLOS, 2002], SMARTS [ISCA, 2003] � Use efficient checkpoint techniques to reduce simulation time to minutes � TurboSMARTS [SIGMETRICS, 2005], Biesbrouck [HiPEAC, 2005] Sample points – simulate in detail 3 of 32

Reduce Simulated Instructions: Statistical Simulation Generate a short synthetic trace (with statistical � properties similar to original workload) for simulation Eeckhout [ISCA, 2004], Oskin [ISCA, 2000] � Nussbaum [PACT, 2001] Stage 1 Stage 2 Execution Synthetic Statistical Program Driven Trace Image Profiling Generation Synthetic Trace Simulation IPC Config Stage 3 4 of 32

AXCIS Framework (performed once) Stage 1 • Machine independent CIST Dynamic except for branch Program predictor and cache Canonical Trace & organizations Instruction Compressor Inputs Segment Table • Stores all information needed for performance analysis Configs AXCIS IPC1 In-order superscalars: • Issue width Performance IPC2 • # of functional units • # of cache primary- Model IPC3 miss tags • Latencies • Branch penalty Stage 2 5 of 32

In-Order Superscalar Machine Model ( size & penalty ) ( org. & latency ) ( ) Branch Pred. Parameters Blocking Icache Fetch Memory Issue ( issue width ) Non- blocking Dcache FPU ALU ( latency ) LSU . . . ( number of units ) (# primary miss tags) (latency) (organization Completion & latency) 6 of 32

Stage 1: Dynamic Trace Compression (performed once) Stage 1 CIST Dynamic Program Canonical Trace & Instruction Compressor Inputs Segment Table Configs AXCIS IPC1 Performance IPC2 Model IPC3 Stage 2 7 of 32

Instruction Segments Events: (dcache, icache, bpred) addq (-- , hit, correct) instruction segment ldq (miss, hit, correct) subq (--, hit, correct) defining stq (miss, hit, correct) instruction � An instruction segment captures all performance- critical information associated with a dynamic instruction 8 of 32

Instruction Segments Events: (dcache, icache, bpred) addq (-- , hit, correct) instruction segment ldq (miss, hit, correct) subq (--, hit, correct) defining stq (miss, hit, correct) instruction � An instruction segment captures all performance- critical information associated with a dynamic instruction 9 of 32

Dynamic Trace Compression � Program behavior repeats due to loops, and repeated function calls � Multiple different dynamic instruction segments can have the same behavior (canonically equivalent) regardless of the machine configuration � Compress the dynamic trace by storing in a table: � 1 copy of each type of segment � How often we see it in the dynamic trace 10 of 32

Canonical Instruction Segment Table CIST Freq Segment Int_ALU addq (--, hit, correct) Int_ALU 1 ldq (miss, hit, correct) addq (--, hit, correct) ldq (miss, hit, correct) subq (--, hit, correct) stq (miss, hit, correct) 11 of 32

Canonical Instruction Segment Table CIST Freq Segment Int_ALU addq (--, hit, correct) Int_ALU 1 Load_Miss ldq (miss, hit, correct) Int_ALU 1 Load_Miss addq (--, hit, correct) ldq (miss, hit, correct) subq (--, hit, correct) stq (miss, hit, correct) 12 of 32

Canonical Instruction Segment Table CIST Freq Segment addq (--, hit, correct) Int_ALU 1 ldq Load_Miss (miss, hit, correct) Int_ALU 1 Load_Miss addq Int_ALU (--, hit, correct) Load_Miss ldq 1 (miss, hit, correct) Int_ALU subq (--, hit, correct) stq (miss, hit, correct) 13 of 32

Canonical Instruction Segment Table CIST Freq Segment addq (--, hit, correct) Int_ALU 1 ldq (miss, hit, correct) Int_ALU 1 2 Load_Miss addq Int_ALU (--, hit, correct) Load_Miss Load_Miss ldq 1 (miss, hit, correct) Int_ALU subq (--, hit, correct) stq (miss, hit, correct) 14 of 32

Canonical Instruction Segment Table CIST Freq Segment addq (--, hit, correct) Int_ALU 1 ldq (miss, hit, correct) Int_ALU 1 2 Load_Miss addq (--, hit, correct) Load_Miss Load_Miss ldq 1 2 (miss, hit, correct) Int_ALU subq Int_ALU (--, hit, correct) stq (miss, hit, correct) 15 of 32

Canonical Instruction Segment Table CIST Freq Segment addq (--, hit, correct) Int_ALU 1 ldq (miss, hit, correct) Int_ALU 1 2 Load_Miss addq (--, hit, correct) Load_Miss ldq 1 2 (miss, hit, correct) Load_Miss Int_ALU subq Int_ALU (--, hit, correct) Load_Miss Int_ALU 1 stq (miss, hit, correct) Store_Miss Store_Miss Total ins: 6 16 of 32

Stage 2: AXCIS Performance Model (performed once) Stage 1 CIST Dynamic Program Canonical Trace & Instruction Compressor Inputs Segment Table Config AXCIS Performance IPC Model Stage 2 17 of 32

AXCIS Performance Model Calculates IPC using a single linear dynamic � programming pass over the CIST entries Total work is proportional to the # of CIST entries � Total Ins Total Ins = = IPC + Total Cycles Total Ins Total Effective Stalls CIST Size = ∑ Total Effective Stalls Freq(i) * EffectiveS talls(Defi ningIns(i) ) = i 1 EffectiveStalls = MAX ( stalls (DataHazards), stalls (StructuralHazards), stalls (ControlFlowHazards) ) 18 of 32

Performance Model Calculations Freq Segment Stalls State For each defining instruction: Int_ALU 0 1 � Calculate its effective stalls & Int_ALU its corresponding 2 Load_Miss 2 microarchitecture state snapshot Load_Miss � Follow 2 99 Int_ALU dependencies to look up the Load_Miss effective stalls & 1 99 Int_ALU state of other instructions in Store_Miss ??? ??? previous entries Look up in previous segment Total ins: 6 Calculate 19 of 32

Stall Cycles From Data Hazards Freq Stalls State Input configuration: Load_Miss … Ins Type Latency (cycles) Int_ALU 1 99 Int_ALU 3 Store_Miss ??? ??? Load_Miss 100 Use data dependencies (e.g. RAW) to detect data hazards � Stalls(DataHazards) � = MAX ( -1, Latency ( producer = Load_Miss ) – DepDist – EffectiveStalls ( IntermediateIns = Int_ALU ) ) = MAX (-1, (100 – 2 – 99) ) = -1 stalls (can issue with previous instruction) 20 of 32

Stall Cycles from Structural Hazards Freq Stalls Microarchitectural State Load_Miss … Int_ALU 1 99 Store_Miss ??? ??? CISTs record special dependencies to capture all possible � structural hazards across all configurations The AXCIS performance model follows these special � dependencies to find the necessary microarchitectural states to: 1. Determine if a structural hazard exists & the number of stall cycles until it is resolved 2. Derive the microarchitectural state after issuing the current defining instruction 21 of 32

Stall Cycles From Control Flow Hazards Freq Icache Branch Pred. Load_Miss … … 1 Int_ALU … … Store_Miss hit correct & not taken Control flow events directly map to stall cycles � Icache Bpred Stalls Hit Incorrect & taken/not taken Mispred penalty Correct & taken 0 Correct & not taken -1 Miss Incorrect & taken/not taken Memory latency + mispred penalty Correct & taken Memory latency Correct & not taken Memory latency - 1 22 of 32

Lossless Compression Scheme Lossless Compression Scheme: (perfect accuracy) � Compress two segments if they always experience the same � stall cycles regardless of the machine configuration Impractical to implement within the Dynamic Trace Compressor � addq (--, hit, correct) ldiq always Issues with addq ldiq (--, hit, correct) addq (--, hit, correct) stq stq (miss, hit, correct) (miss, hit, correct) 23 of 32

Three Compression Schemes Instruction Characteristics Based Compression: � Compress segments that “look” alike (i.e. have the same length, � instruction types, dependence distances, branch and cache behaviors) Limit Configurations Based Compression: � Compress segments whose defining instructions have the same � instruction types, stalls and microarchitectural state under the 2 configurations simulated during trace compression Relaxed Limit Configurations Based Compression: � Relaxed version of the limit-based scheme – does not compare � microarchitectural state Improves compression at the cost of accuracy � 24 of 32

AXCIS: Accelerating Architectural Exploration using Canonical - PowerPoint PPT Presentation

AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments Rose Liu & Krste Asanovi Computer Architecture Group MIT CSAIL 1 Simulation for Large Design Space Exploration Large design space studies explore

OBAMA PRESIDENTIAL CENTER INTRODUCTION 2 INTRODUCTION 3 ARCHITECTURAL DESIGN 4 ARCHITECTURAL

Religious Architectural Religious Architectural Religious Architectural Religious Architectural

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

NES Architectural Ltd http://www.nes-solutions.co.uk/architectural Who Are we? NES Architectural

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

Basics Architectural Presentation Basics Architectural Presentation Filesize: 6.51 MB Reviews

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Banking software architecture 2 Architectural Styles 1 WebLogic Network Gatekeeper's software

Architectural Reconfiguration Architectural Reconfiguration using Coordinated Atomic Actions

Acacia Mining plc Exploration Roundtable 11.12.2015 Exploration roundtable Investment in

in Advanced . Exploration 1 . Note 1 : Advanced Exploration: Defined as confirmed

MEAP and ENB Exploration Exploration in MEAP Genesis of Exploration New Business

Exploration Strategy Exploration Strategy Workshop Workshop Scott Doc Horowitz Scott

Decommissioning: Winds of Change in Offshore Oil & Gas Accelerating NAMEPA & NOIA Winds

Sustainably Faster: Accelerating Sustainably Faster: Accelerating Innovation in Transportation

SSL Accelerating Test Bench SSL accelerating Test Method Stefan Deelen & Maurits van der

Current and Emerging I have nothing to disclose. Strategies for Osteoporosis Anne Schafer, MD

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 9 Yan n Gu An

Back-end missing pieces Simone Campanoni simonec@eecs.northwestern.edu Instruction selection is

Recap: Assembly View of the Machine CPU Memory Addresses Registers Code Data PC Data

Writing Declarative Specifications for Clauses Martin Gebser 1 , 2 , Tomi Janhunen 1 , Roland

Practical Dynamic Symbolic Execution of Standalone JavaScript Johannes Kinder Royal Holloway,

Linking and Loading ! Preparing Program for Execution ! Relocation ! Address binding !

Source Code Optimization Felix von Leitner Code Blau GmbH leitner@codeblau.de October 2009

AXCIS: Accelerating Architectural Exploration using Canonical - PowerPoint PPT Presentation

AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments Rose Liu & Krste Asanovi Computer Architecture Group MIT CSAIL 1 Simulation for Large Design Space Exploration Large design space studies explore

OBAMA PRESIDENTIAL CENTER INTRODUCTION 2 INTRODUCTION 3 ARCHITECTURAL DESIGN 4 ARCHITECTURAL

Religious Architectural Religious Architectural Religious Architectural Religious Architectural

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

NES Architectural Ltd http://www.nes-solutions.co.uk/architectural Who Are we? NES Architectural

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

Basics Architectural Presentation Basics Architectural Presentation Filesize: 6.51 MB Reviews

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Banking software architecture 2 Architectural Styles 1 WebLogic Network Gatekeeper's software

Architectural Reconfiguration Architectural Reconfiguration using Coordinated Atomic Actions

Acacia Mining plc Exploration Roundtable 11.12.2015 Exploration roundtable Investment in

in Advanced . Exploration 1 . Note 1 : Advanced Exploration: Defined as confirmed

MEAP and ENB Exploration Exploration in MEAP Genesis of Exploration New Business

Exploration Strategy Exploration Strategy Workshop Workshop Scott Doc Horowitz Scott

Decommissioning: Winds of Change in Offshore Oil &amp; Gas Accelerating NAMEPA &amp; NOIA Winds

Sustainably Faster: Accelerating Sustainably Faster: Accelerating Innovation in Transportation

SSL Accelerating Test Bench SSL accelerating Test Method Stefan Deelen &amp; Maurits van der

Current and Emerging I have nothing to disclose. Strategies for Osteoporosis Anne Schafer, MD

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 9 Yan n Gu An

Back-end missing pieces Simone Campanoni simonec@eecs.northwestern.edu Instruction selection is

Recap: Assembly View of the Machine CPU Memory Addresses Registers Code Data PC Data

Writing Declarative Specifications for Clauses Martin Gebser 1 , 2 , Tomi Janhunen 1 , Roland

Practical Dynamic Symbolic Execution of Standalone JavaScript Johannes Kinder Royal Holloway,

Linking and Loading ! Preparing Program for Execution ! Relocation ! Address binding !

Source Code Optimization Felix von Leitner Code Blau GmbH leitner@codeblau.de October 2009

Decommissioning: Winds of Change in Offshore Oil & Gas Accelerating NAMEPA & NOIA Winds

SSL Accelerating Test Bench SSL accelerating Test Method Stefan Deelen & Maurits van der