ProfileMe : Hardware-Support for Instruction-Level Profiling on - - PowerPoint PPT Presentation

profileme hardware support for instruction level
SMART_READER_LITE
LIVE PREVIEW

ProfileMe : Hardware-Support for Instruction-Level Profiling on - - PowerPoint PPT Presentation

ProfileMe : Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George Chrysos Digital


slide-1
SLIDE 1

ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors

Jeffrey Dean Jamey Hicks Carl Waldspurger

TM

1

Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George Chrysos Digital Equipment Corporation

slide-2
SLIDE 2

Motivation

  • Identify Performance Bottlenecks

– especially unpredictable dynamic stalls e.g. cache misses, branch mispredicts, etc. – complex out-of-order processors make this difficult

  • Guide Optimizations

– help programmers understand and improve code

TM

2

– help programmers understand and improve code – automatic, profile-driven optimizations

  • Profile Production Workloads

– low overhead – transparent – profile whole system

slide-3
SLIDE 3

Outline

  • Obtaining Instruction-Level Information
  • ProfileMe

– sample instructions, not events – sample interactions via paired sampling

  • Potential Applications of Profile Data
  • TM

3

  • Future Work
  • Conclusions
slide-4
SLIDE 4

Existing Instruction-Level Sampling

  • Use Hardware Event Counters

– small set of software-loadable counters – each counts single event at a time, e.g. dcache miss – counter overflow generates interrupt

  • Advantages

– low overhead vs. simulation and instrumentation

TM

4

– low overhead vs. simulation and instrumentation – transparent vs. instrumentation – complete coverage, e.g. kernel, shared libs, etc.

  • Effective on In-Order Processors

– analysis computes execution frequency – heuristics identify possible reasons for stalls – example: DIGITAL’s Continuous Profiling Infrastructure

slide-5
SLIDE 5

Problems with Event-Based Counters

  • Can’t Simultaneously Monitor All Events
  • Limited Information About Events

– “event has occurred”, but no additional context e.g. cache miss latencies, recent execution path, ...

  • Blind Spots in Non-Interruptible Code
  • Key Problem: Imprecise Attribution

TM

5

  • Key Problem: Imprecise Attribution

– interrupt delivers restart PC, not PC that caused event – problem worse on out-of-order processors

slide-6
SLIDE 6

Problem: Imprecise Attribution

  • Experiment

– monitor data loads – loop: single load + hundreds of nops

  • In-Order Processor

– Alpha 21164 – skew; large peak

2 4 6 8 10 12

782

load

TM

6

– skew; large peak – analysis plausible

  • Out-of-Order Processor

– Intel Pentium Pro – skew and smear – analysis hopeless

50 100 150 200 14 16 18 20 22 24

Sample Count

slide-7
SLIDE 7

Outline

  • Obtaining Instruction-Level Information
  • ProfileMe

– sample instructions, not events – sample interactions via paired sampling

  • Potential Applications of Profile Data
  • TM

7

  • Future Work
  • Conclusions
slide-8
SLIDE 8

ProfileMe: Instruction-Centric Profiling

fetch map issue exec retire

counter

  • verflow?

tag!

TM

8

icache branch predict dcache interrupt! arith units done? pc addr retired? miss? stage latencies tagged? history mp? capture! internal processor registers miss?

slide-9
SLIDE 9

Instruction-Level Statistics

  • PC + Retire Status
  • execution frequency
  • PC + Cache Miss Flag
  • cache miss rates
  • PC + Branch Mispredict
  • mispredict rates
  • PC + Event Flag
  • event rates

TM

9

  • PC + Branch Direction
  • edge frequencies
  • PC + Branch History
  • path execution rates
  • PC + Latency
  • instruction stalls
slide-10
SLIDE 10

Example: Retire Count Convergence

1 1.5 2

ate / Actual Accuracy ∝ 1/√N

TM

10

0.5 250 500

Number of Retired Samples (N )

Estimat

slide-11
SLIDE 11

Identifying True Bottlenecks

  • ProfileMe: Detailed Data for Single Instruction
  • In-Order Processors

– ProfileMe PC + latency data identifies stalls – stalled instructions back up pipeline

  • Out-of-Order Processors

– explicitly designed to mask stall latency

TM

11

– explicitly designed to mask stall latency e.g. dynamic reordering, speculative execution – stall does not necessarily imply bottleneck

  • Example: Does This Stall Matter?

load r1, … add …,r1,… average latency: 35.0 cycles … other instructions …

slide-12
SLIDE 12

Issue: Need to Measure Concurrency

  • Interesting Concurrency Metrics

– retired instructions per cycle – issue slots wasted while an instruction is in flight – pipeline stage utilization

How to Measure Concurrency?

TM

12

How to Measure Concurrency?

  • Special-Purpose Hardware

– some metrics difficult to measure e.g. need retire/abort status

  • Sample Potentially-Concurrent Instructions

– aggregate info from pairs of samples – statistically estimate metrics

slide-13
SLIDE 13

Paired Sampling

  • Sample Two Instructions

– may be in-flight simultaneously – replicate ProfileMe hardware, add intra-pair distance

  • Nested Sampling

– sample window around first profiled instruction – randomly select second profiled instruction

TM

13

– randomly select second profiled instruction – statistically estimate frequency for F(first, second)

+W

... ... ... ... ... ... ... ...

  • W

time

  • verlap

no overlap

slide-14
SLIDE 14

Other Uses of Paired Sampling

  • Path Profiling

– two PCs close in time can identify execution path – identify control flow, e.g. indirect branches, calls, traps

  • Direct Latency Measurements

– data load-to-use – loop iteration cost

TM

14

– loop iteration cost

slide-15
SLIDE 15

Outline

  • Obtaining Instruction-Level Information
  • ProfileMe

– sample instructions, not events – sample interactions via paired sampling

  • Potential Applications of Profile Data
  • TM

15

  • Future Work
  • Conclusions
slide-16
SLIDE 16

Exploiting Profile Data

  • Latencies and Concurrency

– identify and understand bottlenecks – improved scheduling, code generation

  • Cache Miss Data

– code stream rearrangement – guide prefetching, instruction scheduling

TM

16

– guide prefetching, instruction scheduling

  • Miss Addresses

– inform OS page mapping policies – data reorganization

  • Branch History, PC Pairs

– identify common execution paths – trace scheduling

slide-17
SLIDE 17

Example: Path Profiles

  • Experiment

– intra-procedural path reconstruction – control-flow merges – SPECint95 data

  • Execution Counts

– most likely path

60 70 80 90 100 rrect Path

TM

17

– most likely path based on frequency

  • History Bits

– path consistent with global branch history

  • History + Pairs

– path must contain both PCs in pair

10 20 30 40 50 1 2 3 4 5 6 7 8 9 10 11 Path Length (branches) % Single Cor

slide-18
SLIDE 18

Future Work

  • Analyze Production Systems
  • Develop New Analyses for ProfileMe Data

– “cluster” samples using events, branch history – reconstruct frequently-occurring pipeline states

  • Explore Automatic Optimizations

– better scheduling, prefetching, code and data layout

TM

18

– better scheduling, prefetching, code and data layout – inform OS policies

  • ProfileMe for Memory System Transactions

– can sample memory behavior not visible from processor – sample cache sharing and interference

slide-19
SLIDE 19

Related Work

  • Westcott & White (IBM Patent)

– collects latency and some event info for instructions – only for retired instructions – only when instruction is assigned particular inum, which can introduce bias into samples

  • TM

19

  • Specialized Hardware Mechanisms

– CML Buffer (Bershad et al.) - locations of frequent misses – Informing Loads (Horowitz et al.) - status bit to allow SW to react to cache misses – can often obtain similar info by analyzing ProfileMe data

slide-20
SLIDE 20

Conclusions

  • ProfileMe: “Sample Instructions, Not Events”

– provides wealth of instruction-level information – paired sampling reveals dynamic interactions – modest hardware cost – useful for in-order processors, essential for out-of-order

  • Improvements Over Imprecise Event Counters

TM

20

  • Improvements Over Imprecise Event Counters

– precise attribution – no blind spots – improved event collection e.g. branch history, concurrency, correlated events

slide-21
SLIDE 21

Further Information

DIGITAL’s Continuous Profiling Infrastructure project: http://www.research.digital.com/SRC/dcpi

TM

21

http://www.research.digital.com/SRC/dcpi