Joseph M. Lancaster, Roger D. Chamberlain Dept. of Computer Science - PowerPoint PPT Presentation

Joseph M. Lancaster, Roger D. Chamberlain Dept. of Computer Science and Engineering Washington University in St. Louis {lancaster, roger}@wustl.edu Research supported by NSF grant CNS-0720667 Performance Monitoring of Diverse Computer Systems

� Run correctly � Do not dead-lock � Meet hard real-time deadlines � Run fast � High-throughput / low latency � Low rate of soft deadline misses Infrastructure should help us debug when it runs incorrectly or slow 9/25/08 – HPEC 2008 Performance Monitoring of Diverse Computer Systems 2

� Increasingly common in HPEC systems � e.g. Mercury, XtremeData, DRC, Nallatech, ClearSpeed CMP CORE CORE FPGA Logic µP 9/30/2008 Performance Monitoring of Diverse Computer Systems 3

� App deployed using all four components G F P P CMP CMP CMP CORE CORE CORE CORE U G A FPGA GPU CMP 9/30/2008 Performance Monitoring of Diverse Computer Systems 4

CMP CMP CORE CORE CORE CORE CORE CORE CORE CORE FPGA GPU Cell CMP C C C C C C C C CORE LOGIC CORE x256 O O O O O O O O R R R R R R R R E E E E E E E E 9/30/2008 Performance Monitoring of Diverse Computer Systems 5

� Large performance gains realized � Power efficient compared to CMP alone Requires knowledge of individual – architectures/languages Components operate independently – � Distributed system � Separate memories and clocks 9/30/2008 Performance Monitoring of Diverse Computer Systems 6

Tool support for these systems insufficient � Many architectures lack tools for monitoring and validation � Tools for different architectures not integrated � Ad hoc solutions Solution: Runtime performance monitoring and validation for diverse systems! 9/30/2008 Performance Monitoring of Diverse Computer Systems 7

� Introduction � Runtime performance monitoring � Frame monitoring � User-guidance 9/30/2008 Performance Monitoring of Diverse Computer Systems 8

� Natural fit for diverse HPEC systems � Dataflow model Composed of � blocks and B edges A D Blocks compute � concurrently C Data flows � along edges � Languages: StreamIt, Streams-C, X 9/30/2008 Performance Monitoring of Diverse Computer Systems 9

B A D C FPGA CMP CORE 1 CORE 2 GPU 9/30/2008 Performance Monitoring of Diverse Computer Systems 10

B A D C FPGA CMP B CORE 1 CORE 2 GPU A D C 9/30/2008 Performance Monitoring of Diverse Computer Systems 11

Programming Strategy Tools / model Environments Shared Memory Execution profiling gprof, Valgrind, PAPI Message Passing Execution profiling, TAU, mpiP, message logging PARAVER Stream Simulation StreamIt [MIT], Programming StreamC [Stanford], Streams-C [LANL], Auto-Pipe [WUSTL] 9/30/2008 Performance Monitoring of Diverse Computer Systems 12

� Limitations for diverse systems � No universal PC or architecture � No shared memory � Different clocks � Communication latency and bandwidth 9/30/2008 Performance Monitoring of Diverse Computer Systems 13

� Simulation is a useful first step but: � Models can abstract away system details � Too slow for large datasets � HPEC applications growing in complexity � Need to monitor deployed, running app � Measure actual performance of system � Validate performance of large, real-world datasets 9/30/2008 Performance Monitoring of Diverse Computer Systems 14

� Report more than just aggregate statistics Capture rare events � � Quantify measurement impact where possible Overhead due to sampling, communication, etc. � � Measure runtime performance efficiently � Low overhead � High accuracy � Validate performance of real datasets � Increase developer productivity 9/30/2008 Performance Monitoring of Diverse Computer Systems 15

� Monitor edges / queues � Find bottlenecks in app � Change over time? � Computation or communication? � Measure latency between two points 2 4 6 1 3 5 9/30/2008 Performance Monitoring of Diverse Computer Systems 16

� Interconnects are a precious resource � Uses same interconnects as application � Stay below bandwidth constraint � Keep perturbation low CMP FPGA FPGA Agent CPU Monitor Agent Server µP App. App. Logic App. Code Code CORE CORE 9/30/2008 Performance Monitoring of Diverse Computer Systems 17

� Understand measurement perturbation � Dedicate compute resources when possible � Aggressively reduce amount of performance meta-data stored and transmitted � Utilize compression in both time resolution and fidelity of data values � Use knowledge from user to specify their performance expectations / measurements 9/30/2008 Performance Monitoring of Diverse Computer Systems 18

� Use CMP core as the server monitor � Monitor other cores for performance information � Process data from agents (e.g. FPGA, GPU) � Combine hardware and software information for global view � Use logical clocks to synchronize events � Dedicate unused FPGA area to monitoring 9/30/2008 Performance Monitoring of Diverse Computer Systems 19

� Introduction � Runtime Performance Monitoring � Frame monitoring � User-guidance 9/30/2008 Performance Monitoring of Diverse Computer Systems 20

9/30/2008 Performance Monitoring of Diverse Computer Systems 21

� A frame summarizes performance over a period of the execution � Maintain some temporal information Capture system performance anomalies � Time 9/30/2008 Performance Monitoring of Diverse Computer Systems 22

� A frame summarizes performance over a period of the execution � Maintain some temporal information Capture system performance anomalies � 1 Time 9/30/2008 Performance Monitoring of Diverse Computer Systems 23

� A frame summarizes performance over a period of the execution � Maintain some temporal information Capture system performance anomalies � 1 2 3 4 5 6 Time 9/30/2008 Performance Monitoring of Diverse Computer Systems 24

� A frame summarizes performance over a period of the execution � Maintain some temporal information Capture system performance anomalies � 1 2 3 4 5 6 7 8 9 Time 9/30/2008 Performance Monitoring of Diverse Computer Systems 25

� Each frame reports one performance metric � Frame size can be dynamic � Dynamic bandwidth budget � Low variance data / application phases � Trade temporal granularity for lower perturbation � Frames from different agents will likely be unsynchronized and different sizes � Monitor server presents user with consistent global view of performance 9/30/2008 Performance Monitoring of Diverse Computer Systems 26

� Introduction � Runtime Performance Monitoring � Frame Monitoring � User-guidance 9/30/2008 Performance Monitoring of Diverse Computer Systems 27

� Why? � Related work: Performance Assertions for Mobile Devices [Lenecevicius’06] � Validates user performance assertions on multi- threaded embedded CPU � Our system enables validation of performance expectations across diverse architectures 9/30/2008 Performance Monitoring of Diverse Computer Systems 28

Measurement 1. � User specifies a set of “taps” for agent � Taps can be off an edge or an input queue � Agent then records events on each tap � Supported measurements for a tap: � Average value + standard deviation � Min or max value � Histogram of values � Outliers (based on parameter) � Basic arithmetic and logical operators on taps: � Arithmetic: add, subtract, multiply, divide � Logic: and, or, not 9/30/2008 Performance Monitoring of Diverse Computer Systems 29

� What is the throughput of block A? Measurement Context A Runtime Monitor 9/30/2008 Performance Monitoring of Diverse Computer Systems 30

� What is throughput of block A when it is not data starved? Measurement Context A Runtime Monitor 9/30/2008 Performance Monitoring of Diverse Computer Systems 31

� What is the throughput of block A when � not starved for data and � no downstream congestion Measurement Context A Runtime Monitor 9/30/2008 Performance Monitoring of Diverse Computer Systems 32

Measurement 1. � Set of “taps” for agent to count, histogram, or perform simple logical operations on � Taps can be an edge or an input queue Performance assertion 2. � User describes their performance expectations of an application as assertions � Runtime monitor validates these assertions by collecting measurements and evaluating logical expressions � Arithmetic operators: +, -, *, / � Logical operators: and, or, not � Annotations: t, L 9/25/08 – HPEC 2008 Performance Monitoring of Diverse Computer Systems 33

� throughput: “at least 100 A. Input events will be produced in any period of 1001 time units” � t ( A.Input [i +100]) – t ( A.Input [i]) ≤ 1001 � latency: “ A.Output is generated no more than 125 time units after A.Input” � t ( A.Output [i]) – t ( A.Input [i]) ≤ 125 � queue bound: “ A.InQueue never exceeds 100 elements” � L ( A.InQueue [i]) ≤ 100 9/30/2008 Performance Monitoring of Diverse Computer Systems 34

Joseph M. Lancaster, Roger D. Chamberlain Dept. of Computer Science - PowerPoint PPT Presentation

Joseph M. Lancaster, Roger D. Chamberlain Dept. of Computer Science and Engineering Washington University in St. Louis {lancaster, roger}@wustl.edu Research supported by NSF grant CNS-0720667 Performance Monitoring of Diverse Computer Systems

Red Streams Blue Lancaster County Lancaster County Lancaster County Clean Water Partners Clean

LIGHTEN UP LANCASTER COUNTY AND LIVEWELL LANCASTER COUNTY ACTIVE TRANSPORTATION EFFORTS Fall

B LENDED LEARNING AND COURSE DESIGN Liz Chamberlain April 2016 Liz.Chamberlain@open.ac.uk A IMS

m LANCASTER COUNTY CONVENTION CENTER AUTHORITY LANCASTER COUNTY CONVENTION CENTER AUTHORITY o

Beyond the Castle (Lancaster City Park) What is Beyond the Castle? Lancaster Square Routes 'Three

Jean-Max Roger Sancerre Menetou-Salon Pouilly-Fum Jean-Max Roger s.a.s. 11 place du Carrou

Roger-the-Crab download your copy of the Roger simulator from

Targeted Marketing and Response Modelling Roger Beecham www.roger-beecham.com Targeted

The Cascade High Productivity Language The Cascade High Productivity Language Brad Chamberlain

S.R. Moss Cigar Co./Lancaster Press Building N. Prince Street, Lancaster, PA Historic

O U R H E R I T A G E ROYAL LANCASTER LONDON EVENTS PRESENTATION L O C A T I O N & S U

Modelling functional connectivity pathways for bats in urban landscapes Gemma Davies 1 , James Hale

Statistics in Practice Forensic Science Dr. David Lucy d.lucy@lancaster.ac.uk Lancaster

Part 1 Re cap of events The most significant flooding in Lancaster was caused during

Challenges in Geocoding Socially-Generated Data J. J. Huck 1 , J. D. Whyatt 2 , P. Coulton 3 1

Gorman-Lancaster Approach to Estimating Demand for New Good Gorman (1956, 1980) and Lancaster

rrss rrt Pr

Comparison'of'Bulk'Built7In' Current'Sensors'(BBICS)'in'terms'of'

Causal Consistency For Large Neo4j Clusters Jim Webber Chief Scientist, Neo4j QCon London Leads

From Python to PySpark and Back Again - Unifying Single-host and Distributed Machine Learning

Logical Analysis of Relativity Theory Abstract for Invited Presentation for Physics Beyond

Presentations All researchers should be able to present their work to an audience. There are some

Tatsuhiro Nakada Yoshiro Sato, Yasushi Hara, Shigehisa Sugiyama, Takaaki Ogata and Yasuhiro Aoki

Logical Frameworks Lilongwe, Malawi 23-27 May 2011 Session Objectives Understand what

Joseph M. Lancaster, Roger D. Chamberlain Dept. of Computer Science - PowerPoint PPT Presentation

Joseph M. Lancaster, Roger D. Chamberlain Dept. of Computer Science and Engineering Washington University in St. Louis {lancaster, roger}@wustl.edu Research supported by NSF grant CNS-0720667 Performance Monitoring of Diverse Computer Systems

Red Streams Blue Lancaster County Lancaster County Lancaster County Clean Water Partners Clean

LIGHTEN UP LANCASTER COUNTY AND LIVEWELL LANCASTER COUNTY ACTIVE TRANSPORTATION EFFORTS Fall

B LENDED LEARNING AND COURSE DESIGN Liz Chamberlain April 2016 Liz.Chamberlain@open.ac.uk A IMS

m LANCASTER COUNTY CONVENTION CENTER AUTHORITY LANCASTER COUNTY CONVENTION CENTER AUTHORITY o

Beyond the Castle (Lancaster City Park) What is Beyond the Castle? Lancaster Square Routes 'Three

Jean-Max Roger Sancerre Menetou-Salon Pouilly-Fum Jean-Max Roger s.a.s. 11 place du Carrou

Roger-the-Crab download your copy of the Roger simulator from

Targeted Marketing and Response Modelling Roger Beecham www.roger-beecham.com Targeted

The Cascade High Productivity Language The Cascade High Productivity Language Brad Chamberlain

S.R. Moss Cigar Co./Lancaster Press Building N. Prince Street, Lancaster, PA Historic

O U R H E R I T A G E ROYAL LANCASTER LONDON EVENTS PRESENTATION L O C A T I O N &amp; S U

Modelling functional connectivity pathways for bats in urban landscapes Gemma Davies 1 , James Hale

Statistics in Practice Forensic Science Dr. David Lucy d.lucy@lancaster.ac.uk Lancaster

Part 1 Re cap of events The most significant flooding in Lancaster was caused during

Challenges in Geocoding Socially-Generated Data J. J. Huck 1 , J. D. Whyatt 2 , P. Coulton 3 1

Gorman-Lancaster Approach to Estimating Demand for New Good Gorman (1956, 1980) and Lancaster

rrss rrt Pr

Comparison'of'Bulk'Built7In' Current'Sensors'(BBICS)'in'terms'of'

Causal Consistency For Large Neo4j Clusters Jim Webber Chief Scientist, Neo4j QCon London Leads

From Python to PySpark and Back Again - Unifying Single-host and Distributed Machine Learning

Logical Analysis of Relativity Theory Abstract for Invited Presentation for Physics Beyond

Presentations All researchers should be able to present their work to an audience. There are some

Tatsuhiro Nakada Yoshiro Sato, Yasushi Hara, Shigehisa Sugiyama, Takaaki Ogata and Yasuhiro Aoki

Logical Frameworks Lilongwe, Malawi 23-27 May 2011 Session Objectives Understand what

O U R H E R I T A G E ROYAL LANCASTER LONDON EVENTS PRESENTATION L O C A T I O N & S U