Title goes here Tools for Performance Evaluation » Timing and performance evaluation has been an art Experiences and Lessons Learned » Resolution of the clock with a Portable Interface to » Issues about cache effects Hardware Performance Counters » Different systems » Can be cumbersome and inefficient with Jack Dongarra, Kevin London, Shirley Moore, Philip Mucci, traditional tools Daniel Terpstra, Haihang You, and Zhou Min » Situation about to change » Almost all high performance processors include hardware performance counters. » Some are easy to access, others not available to users. » On most platforms the APIs, if they exist, are not appropriate for the end user or well April 26, 2003 I PDPS/ PADTAD 2003 2 documented. Hardware Counters » Small number of registers dedicated for performance monitoring functions – AMD Athlon, 4 counters » PAPI is a proposed “standard” cross-platform interface to – Pentium < = III, 2 counters hardware performance counters. – Pentium IV, 18 counters – IA64, 4 counters » PAPI provides two API s to access the underlying performance counter hardware: – Alpha 21x64, 2 counters – Power 3, 8 counters » A low- level interface designed for tool developers and expert users, and – Power 4, 8 counters » A high- level interface for application engineers. – UltraSparc II, 2 counters – MIPS R14K, 2 counters April 26, 2003 I PDPS/ PADTAD 2003 3 April 26, 2003 I PDPS/ PADTAD 2003 4 PAPI Implementation PAPI Preset Events » Proposed standard set of event names deemed Tools most relevant for application performance tuning » Exact standardization of the semantics not possible P AP I High Level P ort able » eg IBM’s FMA P AP I Low Level Layer » PAPI supports approximately 100 preset events. » Mapped to native events on a given platform » Preset events are mappings from symbolic P AP I Machine Dependent Subst rat e names to machine specific definitions for a Machine particular hardware event. Kernel Ext ension Specif ic » Example: PAPI_TOT_CYC Layer Operat ing Syst em » PAPI also supports presets that may be derived from multiple underlying hardware metrics. Hardware P erf ormance Count ers » Example: PAPI_L1_DCM April 26, 2003 I PDPS/ PADTAD 2003 5 April 26, 2003 I PDPS/ PADTAD 2003 6 I C L 1
Title goes here Sample Preset Listing Support for Native Events > tests/avail » PAPI supports native events: Test case 8: Available events and hardware information. ---------------------------------------------------------------- --------- » An event countable by the CPU can be counted Vendor string and code : GenuineIntel (- 1) Model string and code : Celeron (Mendocino) (6) even if there is no matching preset PAPI event. CPU revision : 10.000000 CPU Megahertz : 366.504944 ---------------------------------------------------------------- --------- » The developer uses the same API as when Name Code Avail Deriv Description (Note) PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache misses setting up a preset event, but a CPU -specific bit PAPI_L1_ICM 0x80000001 Yes No Level 1 instruction cache misses pattern is used instead of the PAPI event PAPI_L2_DCM 0x80000002 No No Level 2 data cache misses PAPI_L2_ICM 0x80000003 No No Level 2 instruction cache definition. misses PAPI_L3_DCM 0x80000004 No No Level 3 data cache misses PAPI_L3_ICM 0x80000005 No No Level 3 instruction cache misses PAPI_L1_TCM 0x80000006 Yes Yes Level 1 cache misses PAPI_L2_TCM 0x80000007 Yes No Level 2 cache misses PAPI_L3_TCM 0x80000008 No No Level 3 cache misses PAPI_CA_SNP 0x80000009 No No Requests for a snoop PAPI_CA_SHR 0x8000000a No No Requests for shared cache line PAPI_CA_CLN 0x8000000b No No Requests for clean cache line PAPI_CA_INV 0x8000000c No No Requests for cache line inv. . . http: / / icl.cs.utk.edu/ proj ects/ papi/ files/ htm l_m an/ papi_presets. htm l April 26, 2003 I PDPS/ PADTAD 2003 7 April 26, 2003 I PDPS/ PADTAD 2003 8 High-level Interface High-level API Calls PAPI_flops(float *rtime, float *ptime, » Meant for application programmers wanting » long_long *flpins, float *mflops) coarse-grained measurements » Wallclock tim e, process tim e, FP ins since start, » Mflop/ s since last call » As easy to use as SGI IRIX prefex calls PAPI_num_counters () » » a com m and- line interface to the R10000 hardware performance » Ret urns t he num ber of available count ers counters PAPI_start_counters(int *cntrs, int alen) » » Requires no setup code » Start counters » Restrictions: PAPI_stop_counters(long_long *vals, int alen) » » Stop counters and put counter values in array » Allows only PAPI presets PAPI_accum_counters(long_long *vals , int alen ) » Not thread safe » » Accum ulate counters into array and reset » Only aggregate counters PAPI_read_counters(long_long *vals, int alen) » » Copy counter values into array and reset counters April 26, 2003 I PDPS/ PADTAD 2003 9 April 26, 2003 I PDPS/ PADTAD 2003 10 Low-level Interface Low-level Functionality » API Calls for: » Increased efficiency and functionality over the high level PAPI interface » Counter multiplexing » SVR4 compatible profiling » Approximately 60 functions » Processor information » Thread -safe (SMP, OpenMP, Pthreads) » Address space information » Supports both preset and native events » Accurate and low latency timing functions » Hardware event inquiry functions » Eventset management functions » Static and dynamic memory information » Simple locking operations » Callbacks on user defined overflow threshold April 26, 2003 I PDPS/ PADTAD 2003 11 April 26, 2003 I PDPS/ PADTAD 2003 12 I C L 2
Title goes here PAPI 2.3.4 Release Design and Implementation Experiences April 14, 2003 Platforms » Enhancements » Success of com m unity -based open source » Static/ dynamic memory » I BM PPC604, 604e, developm ent effort Power 3, Power4, AI X 5 info » Parallel Tools Consortium » Intel x86/ Linux, » IA64 hardware profiling http: / / www.ptools.org / Windows, including and sam pling Pentium IV » Misc bug fixes » Tradeoffs between ease -of-use and » Sun UltraSparc I / I I / I I I » Sample Tools increased functionality and features » SGI MI PS » Perfometer R10K/ R12K/ R14K » Operating system support » Trapper » Com paq Alpha » Dynaprof » I nterfacing to third -party tools 21164/ 21264 with DADD/ DCPI » Data interpretation and accuracy issues » Itanium/ Itanium2 Linux » Efficiency and scalability issues » Cray T3E April 26, 2003 I PDPS/ PADTAD 2003 13 April 26, 2003 I PDPS/ PADTAD 2003 14 Operating System Support Tools » Perfctr kernel patch by Mikael Pettersson required for » Tools developed by the PAPI project Linux/ x86 » Dynaprof » Kernel modification has met resistance from some system » Perfometer administrators » Effort underway to get perfctr into mainstream Linux » Third -party tools release » HPCView (Rice University) » Vendor cooperation has been good (in m ost cases) » SvPablo (University of Illinois) » Register level operations code provided by Cray » TAU (University of Oregon) » I BM pmtoolkit included in AI X 5 » Vampir 3.x (Pallas) » Perfmon library from Hewlett-Packard for Itanium/ Itanium2 Linux » VProf (Sandia National Lab) » DADD (Dynam ic Access to DCPI Data) extension to DCPI » Others (see PAPI home page) from Hewlett-Packard for Alpha Tru64 UNI X April 26, 2003 I PDPS/ PADTAD 2003 15 April 26, 2003 I PDPS/ PADTAD 2003 16 Dynaprof Dynaprof GUI Screenshot » A portable tool to » Avoiding source-code dynamically instrument instrumentation and serial and parallel programs recompilation for the purpose of » Avoiding perturbation of performance analysis compiler optimizations » Simple and intuitive » Providing complete com m and line interface like language independence GDB » Built on DynInst and DPCL » Java/ Swing GUI » I BM and Maryland » Instrumentation is done through the run-tim e insertion of function calls to specially developed perform ance probes. April 26, 2003 I PDPS/ PADTAD 2003 17 April 26, 2003 I PDPS/ PADTAD 2003 18 I C L 3
Title goes here Perfometer Screenshot April 26, 2003 I PDPS/ PADTAD 2003 19 April 26, 2003 I PDPS/ PADTAD 2003 20 HPCViewScreenshot SvPablo from UIUC • Source based instrumentation of loops and function calls for Fortran and C • Profiling statistics based on time and/or hardware counter data • Supports serial, MPI, and OpenMP programs • Freely available April 26, 2003 I PDPS/ PADTAD 2003 21 April 26, 2003 I PDPS/ PADTAD 2003 22 Vampir 3.x Data Accuracy Issues from Pallas http://www.pallas.com/e/products/vampir/index.htm » Act of measuring perturbs the system being measured » Extra instructions » Cache pollution » Servicing interrupts » PC sam pling can be inaccurate on out - of-order processors with speculative execution. » Solutions: » PAPI is being redesigned to keep its runtime overhead and memory footprint as small as possible. » Hardware support for interrupt handling and profiling (e.g., event address registers) is being used where available. » Work by Pat Teller at University of Texas -El Paso on validation of hardware counter data using microbenchmarks April 26, 2003 I PDPS/ PADTAD 2003 23 April 26, 2003 I PDPS/ PADTAD 2003 24 I C L 4
Recommend
More recommend