Energy-efficient & High-performance Energy-efficient & - PowerPoint PPT Presentation

Energy-efficient & High-performance Energy-efficient & High-performance Instruction Fetch using a Block-aware ISA Instruction Fetch using a Block-aware ISA Ahmad Zmily and Christos Kozyrakis Electrical Engineering Department Stanford University

Motivation Motivation � Processor front-end engine – Performs control flow prediction & instruction fetch – Sets upper limit for performance � Cannot execute faster than you can fetch � However, energy efficiency is also important – Dense servers – Same processor core in server and notebook chips – Environmental concerns � Focus of this paper – Can we build front-ends that achieve both goals? Ahmad Zmily, ISLPED’05 2

The Problem Motivation � Front-end detractors – Instruction cache misses – Multi-cycle instruction cache accesses – Control-flow mispredictions & pipeline flushing � The cost for a 4-way superscalar processor – 48% performance loss – 21% increase in total energy consumption Performance Energy 50% 40% % Loss 30% 20% 10% 0% Imperfect Predictor Imperfect I-Cache Imperfect Predictor + Imperfect I-Cache Ahmad Zmily, ISLPED’05 3

BLISS � A block-aware instruction set architecture Outline – Decouples control-flow prediction from instruction fetching – Allows software to help with hardware challenges � Talk outline – BLISS overview � Instruction set and front-end microarchitecture – BLISS opportunities � Performance optimizations � Energy optimizations – Experimental results � 14% performance improvement � 16% total energy improvement – Conclusions Ahmad Zmily, ISLPED’05 4

BLISS Instruction Set Overview Block Descriptors Text Instructions Segment Instructions Conventional ISA BLISS ISA � Explicit basic block descriptors (BBDs) – Stored separately from instructions in the text segment – Describe control flow and identify associated instructions � Execution model – PC always points to a BBD, not to instructions – Atomic execution of basic blocks Ahmad Zmily, ISLPED’05 5

32-bit Descriptor Format Overview Type : type of terminating branch � – Fall-through, jump, jump register, forward/backward branch, call, return, … Offset : displacement for PC-relative branches and jumps � – Offset to target descriptor Length : number of instruction in the basic block � – 0 to 15 instructions – Longer basic blocks use multiple descriptors Instruction pointer : address of the first instruction in the block � – Remaining bits from TLB Hints : optional compiler-generated hints � – This study: branch hints – Biased taken/non-taken branches Ahmad Zmily, ISLPED’05 6

BLISS Code Example Overview numeqz=0; for (i=0; i<N; i++) if (a[i]==0) numeqz++; else foo(); � Example program in C-source code: – Counts the number of zeros in array a – Calls foo() for each non-zero element Ahmad Zmily, ISLPED’05 7

BLISS Code Example Overview BBD1: FT , --- , 1 addu r4,r0,r0 L1: lw r6,0(r1) BBD2: B_F , BBD4, 2 bneqz r6,L2 addui r4,r4,1 BBD3: J, BBD5, 1 j L3 BBD4: JAL, FOO, 0 L2: jal FOO L3: addui r1,r1,4 BBD5: B_B, --- , 2 bneq r1,r2,L1 � All jump instructions are redundant � Several branches can be folded in arithmetic instructions – Branch offset is encoded in descriptors Ahmad Zmily, ISLPED’05 8

BLISS Decoupled Front-End Overview Basic-Block queue Basic Block Descriptor cache decouples prediction from instruction cache replaces BTB Decode I-cache prefetch PC i-cache miss Extra pipe stage to access BB-cache Ahmad Zmily, ISLPED’05 9

BLISS Decoupled Front-End Overview Decode I-cache prefetch PC i-cache miss � BB-cache hit – Push descriptor & predicted target in BBQ � Instructions fetched and executed later (decoupling) – Continue fetching from predicted BBD address Ahmad Zmily, ISLPED’05 10

BLISS Decoupled Front-End Overview Decode I-cache prefetch PC i-cache miss � BB-cache miss – Wait for refill from L2 cache � Calculate 32-bit instruction pointer & target on refill – Back-end only stalls when BBQ and IQ are drained Ahmad Zmily, ISLPED’05 11

BLISS Decoupled Front-End Overview Decode I-cache prefetch PC i-cache miss � Control-flow misprediction – Flush pipeline including BBQ and IQ – Restart from correct BBD address Ahmad Zmily, ISLPED’05 12

Performance Optimizations (1) Optimizations � I-cache is not in the critical path for speculation – BBDs provide branch type and offsets – Multi-cycle I-cache does not affect prediction accuracy – BBQ decouples predictions from instruction fetching � Latency only visible on mispredictions � I-cache misses can be tolerated – BBQ provides early view into instruction stream – Guided instruction prefetch Ahmad Zmily, ISLPED’05 13

Performance Optimizations (2) Optimizations � Judicious use and training of predictor – All PCs refer to basic block boundaries – No predictor access for fall-through or jump blocks – Selective use of hybrid predictor for different types of blocks � If branch hints are used � Better target prediction – No cold-misses for PC-relative branch targets – 36% less number of pipeline flushes with BLISS Ahmad Zmily, ISLPED’05 14

Front-End Energy Optimizations (1) Optimizations � Access only the necessary words in I-cache – The length of each basic block is known – Use segmented word-lines � Serial access of tags and data in I-cache – Reduces energy of associative I-cache � Single data block read – Increase in latency tolerated by decoupling � Merged I-cache accesses – For blocks in BBQ that access same cache lines Ahmad Zmily, ISLPED’05 15

Front-End Energy Optimizations (2) Optimizations � Judicious use and training of predictor – All PCs refer to basic block boundaries – No predictor access for fall-through or jump blocks – Selective use of hybrid predictor for different types of blocks � If branch hints are used � Energy saved on mispredicted instructions – Due to better target and direction prediction – The saving is across the whole processor pipeline � 15% of energy wasted on mispredicted instructions Ahmad Zmily, ISLPED’05 16

Evaluation Methodology Experiments � 4-way superscalar processor – Out-of-order execution, two-level cache hierarchy – Simulated with Simplescalar & Wattch toolsets – SpecCPU2K benchmarks with reference datasets � Comparison: fetch-target-block architecture (FTB) [Reinman et al.] – Similar to BLISS but pure hardware implementation – Hardware creates and caches block and hyperblock descriptors – Similar performance and energy optimizations applied � BLISS code generation – Binary translation from MIPS executables Ahmad Zmily, ISLPED’05 17

Front-end Parameters Experiments Base FTB BLISS Fetch Width 4 Instructions 1 Fetch block 1 Basic block BTB: 1K entries FTB: 1K entries BB-cache: 1K entries Target 4-way 4-way 4-way Predictor 1 cycle access 1 cycle access 1 cycle access 8 entries per line Decoupling -- 8 Entries Queue I-cache 2-cycle pipelined 3-cycle pipelined Latency BTB, FTB, and BB-cache have exactly the same capacity � Ahmad Zmily, ISLPED’05 18

Performance Experiments FTB BLISS BLISS-Hints 38% 50% 35% % IPC Improvement 25% 15% 5% -5% gzip vortex twolf mesa equake AVG Consistent performance advantage for BLISS � – 14% average improvement over base – 9% average improvement over FTB Sources of performance improvement � – 36% reduction pipeline flushes compared to base – 10% reduction in I-cache misses due to prefetching Ahmad Zmily, ISLPED’05 19

FTB vs BLISS Experiments 6 Fetch IPC Commit IPC 5 4 IPC 3 2 1 0 FTB BLISS FTB BLISS FTB BLISS FTB BLISS FTB BLISS FTB BLISS gzip vortex twolf mesa equake average FTB ⇒ higher fetch IPC � – Optimistic, large blocks needed to facilitate block creation – But they lead to overspeculation & predictor interference � Bad for performance and energy BLISS ⇒ higher commit IPC � – Blocks defined by software – Always available in L2 on a miss, no need to recreate – But, no hyperblocks � Suboptimal only for 1 SPEC benchmark (vortex) Ahmad Zmily, ISLPED’05 20

Front-End Energy Experiments FTB BLISS BLISS-Hints 80% % FE Energy Savings 60% 40% 20% 0% gzip vortex twolf mesa equake AVG 65% energy reduction in the front-end � – 40% in the instruction cache – 12% in the predictors – 13% in the BTB/BB-cache Approximately 13% of total chip energy in front-end � – I-cache, predictors, and BTB are bit SRAMs Ahmad Zmily, ISLPED’05 21

Total Chip Energy Experiments FTB BLISS BLISS-Hints 32% 30% % Total Energy Savings 20% 10% 0% gzip vortex twolf mesa equake AVG Total energy = front-end + back-end + all caches � BLISS leads to 16% total energy savings over base � – Front-end savings + savings from fewer mispredictions – FTB leads to 9% savings ED2P comparison (appropriate for high-end chips) � – BLISS offers 83% improvement over base – FTB limited to 35% improvement Ahmad Zmily, ISLPED’05 22

Energy-efficient & High-performance Energy-efficient & - PowerPoint PPT Presentation

Energy-efficient & High-performance Energy-efficient & High-performance Instruction Fetch using a Block-aware ISA Instruction Fetch using a Block-aware ISA Ahmad Zmily and Christos Kozyrakis Electrical Engineering Department Stanford

A High- A High -Performance Area Performance Area- -Efficient Efficient AES Cipher on a Many

Energy Efficient Mortgages Initiative Energy efficient Mortgages Action Plan (EeMAP) Energy

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

SILT A Memory-Efficient, High-Performance Key- Value Store Based on paper of H. Lim, B. Fan,

High Performance and Energy Efficient Machine Learning Accelerators and Variable Precision

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

Energy Efficient Channel Coding Leonardo Fagundes Luz Serrano Energy Efficient Channel Coding

Performance of HPC Middleware over Infiniband WAN Designing Efficient FTP Mechanisms for High

Clean Energy Sources Wind Energy Hydro-Energy Bio-Energy Solar-Energy 1 Why Clean Energy

Drupal High Availability High Performance Samstag, 3. November 12 Drupal High Availability

Efficient signal processing using Haskell and LLVM Henning Thielemann 2016-09-15 Efficient

High Performance Systems EuroMPI 2015 Objectives Yet another performance analysis tool

Corporate Presentation voestalpine High Performance Metals India voestalpine High Performance

Parallel Programming and High-Performance Computing Part 2: High-Performance Networks Dr.

New York University High Performance Computing High Performance Computing Information

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

8th Grade This material is made freely available at www.njctl.org and is intended for the

L ECTURE 8: K INEMATICS E QUATIONS O DOMETRY , D EAD R ECKONING I NSTRUCTOR : G IANNI A. D I C ARO

Structured Linear Algebra Problems: Analysis, Algorithms, and Applications Cortona, Italy -

Casimir effect, theory and experiments Serge Reynaud & Astrid Lambrecht Laboratoire Kastler

DUNE FD Data Selection System: Baseline, Options, and Downselect Timeline Georgia

Making ALL Hardware Respect Your Freedom Seattle GNU/Linux Fest John Sullivan Executive Director

Multi Class Traffic Analysis of Single and Multi-band Queuing System Husnu S aner Narman Md.

Synchronous Programming of Tasks that can miss Deadlines 4 december 2014 Sommaire 01 The FSF

Energy-efficient & High-performance Energy-efficient & - PowerPoint PPT Presentation

Energy-efficient & High-performance Energy-efficient & High-performance Instruction Fetch using a Block-aware ISA Instruction Fetch using a Block-aware ISA Ahmad Zmily and Christos Kozyrakis Electrical Engineering Department Stanford

A High- A High -Performance Area Performance Area- -Efficient Efficient AES Cipher on a Many

Energy Efficient Mortgages Initiative Energy efficient Mortgages Action Plan (EeMAP) Energy

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

SILT A Memory-Efficient, High-Performance Key- Value Store Based on paper of H. Lim, B. Fan,

High Performance and Energy Efficient Machine Learning Accelerators and Variable Precision

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

Energy Efficient Channel Coding Leonardo Fagundes Luz Serrano Energy Efficient Channel Coding

Performance of HPC Middleware over Infiniband WAN Designing Efficient FTP Mechanisms for High

Clean Energy Sources Wind Energy Hydro-Energy Bio-Energy Solar-Energy 1 Why Clean Energy

Drupal High Availability High Performance Samstag, 3. November 12 Drupal High Availability

Efficient signal processing using Haskell and LLVM Henning Thielemann 2016-09-15 Efficient

High Performance Systems EuroMPI 2015 Objectives Yet another performance analysis tool

Corporate Presentation voestalpine High Performance Metals India voestalpine High Performance

Parallel Programming and High-Performance Computing Part 2: High-Performance Networks Dr.

New York University High Performance Computing High Performance Computing Information

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

8th Grade This material is made freely available at www.njctl.org and is intended for the

L ECTURE 8: K INEMATICS E QUATIONS O DOMETRY , D EAD R ECKONING I NSTRUCTOR : G IANNI A. D I C ARO

Structured Linear Algebra Problems: Analysis, Algorithms, and Applications Cortona, Italy -

Casimir effect, theory and experiments Serge Reynaud &amp; Astrid Lambrecht Laboratoire Kastler

DUNE FD Data Selection System: Baseline, Options, and Downselect Timeline Georgia

Making ALL Hardware Respect Your Freedom Seattle GNU/Linux Fest John Sullivan Executive Director

Multi Class Traffic Analysis of Single and Multi-band Queuing System Husnu S aner Narman Md.

Synchronous Programming of Tasks that can miss Deadlines 4 december 2014 Sommaire 01 The FSF

Casimir effect, theory and experiments Serge Reynaud & Astrid Lambrecht Laboratoire Kastler