ProfileMe : Hardware-Support for Instruction-Level Profiling on - PowerPoint PPT Presentation

ProfileMe : Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George Chrysos Digital Equipment Corporation TM 1

Motivation • Identify Performance Bottlenecks – especially unpredictable dynamic stalls e.g. cache misses, branch mispredicts, etc. – complex out-of-order processors make this difficult • Guide Optimizations – help programmers understand and improve code – help programmers understand and improve code – automatic, profile-driven optimizations • Profile Production Workloads – low overhead – transparent – profile whole system TM 2

Outline • Obtaining Instruction-Level Information • ProfileMe – sample instructions, not events – sample interactions via paired sampling • Potential Applications of Profile Data • • Future Work • Conclusions TM 3

Existing Instruction-Level Sampling • Use Hardware Event Counters – small set of software-loadable counters – each counts single event at a time, e.g. dcache miss – counter overflow generates interrupt • Advantages – low overhead vs. simulation and instrumentation – low overhead vs. simulation and instrumentation – transparent vs. instrumentation – complete coverage, e.g. kernel, shared libs, etc. • Effective on In-Order Processors – analysis computes execution frequency – heuristics identify possible reasons for stalls – example: DIGITAL ’s Continuous Profiling Infrastructure TM 4

Problems with Event-Based Counters • Can’t Simultaneously Monitor All Events • Limited Information About Events – “event has occurred”, but no additional context e.g. cache miss latencies, recent execution path, ... • Blind Spots in Non-Interruptible Code • Key Problem: Imprecise Attribution • Key Problem: Imprecise Attribution – interrupt delivers restart PC, not PC that caused event – problem worse on out-of-order processors TM 5

Problem: Imprecise Attribution • Experiment load 0 – monitor data loads 2 4 – loop: single load + 782 6 hundreds of nops 8 • In-Order Processor 10 – Alpha 21164 12 – skew; large peak – skew; large peak 14 – analysis plausible 16 • Out-of-Order Processor 18 20 – Intel Pentium Pro 22 – skew and smear 24 – analysis hopeless 0 50 100 150 200 Sample Count TM 6

ProfileMe : Instruction-Centric Profiling counter overflow? fetch map issue exec retire tag! arith interrupt! branch units predict dcache icache done? tagged? pc miss? mp? history stage latencies addr miss? retired? capture! internal processor registers TM 8

Instruction-Level Statistics • PC + Retire Status � � � execution frequency � • PC + Cache Miss Flag � cache miss rates � � � • PC + Branch Mispredict � � � mispredict rates � • PC + Event Flag � event rates � � � • PC + Branch Direction � � edge frequencies � � • PC + Branch History � � � path execution rates � • PC + Latency � � � instruction stalls � TM 9

Example: Retire Count Convergence 2 1.5 ate / Actual Accuracy ∝ 1/ √ N 1 Estimat 0.5 0 0 250 500 Number of Retired Samples ( N ) TM 10

Identifying True Bottlenecks • ProfileMe : Detailed Data for Single Instruction • In-Order Processors – ProfileMe PC + latency data identifies stalls – stalled instructions back up pipeline • Out-of-Order Processors – explicitly designed to mask stall latency – explicitly designed to mask stall latency e.g. dynamic reordering, speculative execution – stall does not necessarily imply bottleneck • Example: Does This Stall Matter? load r1, … average latency: 35.0 cycles add …,r1,… … other instructions … TM 11

Issue: Need to Measure Concurrency • Interesting Concurrency Metrics – retired instructions per cycle – issue slots wasted while an instruction is in flight – pipeline stage utilization How to Measure Concurrency? How to Measure Concurrency? • Special-Purpose Hardware – some metrics difficult to measure e.g. need retire/abort status • Sample Potentially-Concurrent Instructions – aggregate info from pairs of samples – statistically estimate metrics TM 12

Paired Sampling • Sample Two Instructions – may be in-flight simultaneously – replicate ProfileMe hardware, add intra-pair distance • Nested Sampling – sample window around first profiled instruction – randomly select second profiled instruction – randomly select second profiled instruction – statistically estimate frequency for F (first, second) ... ... time ... ... ... ... ... ... overlap no overlap -W +W TM 13

Other Uses of Paired Sampling • Path Profiling – two PCs close in time can identify execution path – identify control flow, e.g. indirect branches, calls, traps • Direct Latency Measurements – data load-to-use – loop iteration cost – loop iteration cost TM 14

Exploiting Profile Data • Latencies and Concurrency – identify and understand bottlenecks – improved scheduling, code generation • Cache Miss Data – code stream rearrangement – guide prefetching, instruction scheduling – guide prefetching, instruction scheduling • Miss Addresses – inform OS page mapping policies – data reorganization • Branch History, PC Pairs – identify common execution paths – trace scheduling TM 16

Example: Path Profiles • Experiment 100 – intra-procedural path reconstruction 90 – control-flow merges 80 – SPECint95 data 70 rrect Path • Execution Counts 60 – most likely path – most likely path % Single Cor 50 based on frequency 40 • History Bits 30 – path consistent with 20 global branch history 10 • History + Pairs 0 – path must contain 1 2 3 4 5 6 7 8 9 10 11 both PCs in pair Path Length (branches) TM 17

Future Work • Analyze Production Systems • Develop New Analyses for ProfileMe Data – “cluster” samples using events, branch history – reconstruct frequently-occurring pipeline states • Explore Automatic Optimizations – better scheduling, prefetching, code and data layout – better scheduling, prefetching, code and data layout – inform OS policies • ProfileMe for Memory System Transactions – can sample memory behavior not visible from processor – sample cache sharing and interference TM 18

Related Work • Westcott & White (IBM Patent) – collects latency and some event info for instructions – only for retired instructions – only when instruction is assigned particular inum, which can introduce bias into samples • • Specialized Hardware Mechanisms – CML Buffer (Bershad et al. ) - locations of frequent misses – Informing Loads (Horowitz et al .) - status bit to allow SW to react to cache misses – can often obtain similar info by analyzing ProfileMe data TM 19

Conclusions • ProfileMe : “Sample Instructions, Not Events” – provides wealth of instruction-level information – paired sampling reveals dynamic interactions – modest hardware cost – useful for in-order processors, essential for out-of-order • Improvements Over Imprecise Event Counters • Improvements Over Imprecise Event Counters – precise attribution – no blind spots – improved event collection e.g. branch history, concurrency, correlated events TM 20

Further Information DIGITAL’s Continuous Profiling Infrastructure project: http://www.research.digital.com/SRC/dcpi http://www.research.digital.com/SRC/dcpi TM 21

ProfileMe : Hardware-Support for Instruction-Level Profiling on - PowerPoint PPT Presentation

ProfileMe : Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George Chrysos Digital

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Hardware Support for Hardware Support for Out-of-Order Instruction Out-of-Order Instruction

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Hardware Support for Compiler Speculation Hardware support for preserving exception behavior

Instruction-Level Parallelism (ILP) Fine-grained parallelism Obtained by: instruction

PowerWizard Level 1.0 & Level 2.0 Control Systems Training Systems Comparison Level 2

EXPLICIT INSTRUCTION EXPLICIT INSTRUCTION Michael L. Kamil Michael L. Kamil Stanford University

Lecture 3: Instruction Lecture 3: Instruction of a computer that a machine language of a

EE 457 Unit 3 Instruction Sets 2 With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3

EE 109 Unit 10 MIPS Instruction Set MIPS INSTRUCTION OVERVIEW 10.3 10.4 Instruction Set

Instruction encoding The ISA defines The format of an instruction (syntax) The

Slide Handouts: Instruction Ask the Expert Welcome to Module 6 Lesson 1. Instruction: Ask the

Instruction Scheduling cs5363 1 Instruction scheduling Reordered Original Instruction code

EE 457 Unit 3 Instruction Sets With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3.3

Overview from Nuclear Lattice Effective Field Theory Serdar Elhatisari Nuclear Lattice EFT

The Viola/Jones Face Detector (2001) A widely used method for real-time object detection.

Robust Statistics Part 1: Introduction and univariate data Peter Rousseeuw LARS-IASC School, May

A brief summary of isotopes and fractionation 1. Isotopes are chemically identical, but

Image-to-Markup Generation with Coarse-to-Fine Attention Anssi Kanervisto 2 Jeffrey Ling 1 Yuntian

Interac(vely Building Geospa(al Mashups Craig A. Knoblock University of Southern California

Bayesian neural networks: a function space view tour Yingzhen Li Microsoft Research Cambridge

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

ProfileMe : Hardware-Support for Instruction-Level Profiling on - PowerPoint PPT Presentation

ProfileMe : Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George Chrysos Digital

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Hardware Support for Hardware Support for Out-of-Order Instruction Out-of-Order Instruction

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Hardware Support for Compiler Speculation Hardware support for preserving exception behavior

Instruction-Level Parallelism (ILP) Fine-grained parallelism Obtained by: instruction

PowerWizard Level 1.0 &amp; Level 2.0 Control Systems Training Systems Comparison Level 2

EXPLICIT INSTRUCTION EXPLICIT INSTRUCTION Michael L. Kamil Michael L. Kamil Stanford University

Lecture 3: Instruction Lecture 3: Instruction of a computer that a machine language of a

EE 457 Unit 3 Instruction Sets 2 With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3

EE 109 Unit 10 MIPS Instruction Set MIPS INSTRUCTION OVERVIEW 10.3 10.4 Instruction Set

Instruction encoding The ISA defines The format of an instruction (syntax) The

Slide Handouts: Instruction Ask the Expert Welcome to Module 6 Lesson 1. Instruction: Ask the

Instruction Scheduling cs5363 1 Instruction scheduling Reordered Original Instruction code

EE 457 Unit 3 Instruction Sets With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3.3

Overview from Nuclear Lattice Effective Field Theory Serdar Elhatisari Nuclear Lattice EFT

The Viola/Jones Face Detector (2001) A widely used method for real-time object detection.

Robust Statistics Part 1: Introduction and univariate data Peter Rousseeuw LARS-IASC School, May

A brief summary of isotopes and fractionation 1. Isotopes are chemically identical, but

Image-to-Markup Generation with Coarse-to-Fine Attention Anssi Kanervisto 2 Jeffrey Ling 1 Yuntian

Interac(vely Building Geospa(al Mashups Craig A. Knoblock University of Southern California

Bayesian neural networks: a function space view tour Yingzhen Li Microsoft Research Cambridge

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

PowerWizard Level 1.0 & Level 2.0 Control Systems Training Systems Comparison Level 2