Technische Universität München Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München September 2010
Technische Universität München Outline Motivation Periscope architecture Periscope performance analysis model Performance analysis strategies in Periscope Periscope GUI
Technische Universität München Motivation Common performance analysis procedure on Power6 systems Use Tprof to pinpoint time-consuming subroutines Use Xprofiler to understand call graph; mpitrace for MPI comm Use hpmcount (libhpm) to measure HW Counters Problem Routine, error-prone and time-consuming Requires deep HW knowledge Mostly post-development process Hard to map bottlenecks to their source code location Solution Automate the performance analysis Integrate parallel application development and performance analysis within the same IDE
Technische Universität München Periscope Iterative online analysis Measurements are configured, obtained and evaluated on the fly no need to store trace files Distributed architecture Reduced network overhead Analysis performed by multiple distributed hierarchical agents Automatic bottlenecks search Based on performance optimization experts' knowledge Single-node Performance on Intel Itanium6, IBM Power6, x86-s MPI Communication OpenMP Performance Enhanced Eclipse-based GUI Instrumentation: Fortran, C/C++; MPI / OpenMP / Hybrid
Technische Universität München Distributed architecture Eclipse-based GUI Graphical User Interface Analysis control Interactive frontend Agents network Monitoring Request Interface Application
Technische Universität München Instrumented Application GUI Analysis Agents Monitoring Start Candidate Properties Requests Location Performance Refinement Measurements Precision Final Properties Proven Properties Raw performance data Report Analysis
Technische Universität München Automatic search for bottlenecks Automation based on formalized expert knowledge Efficient search algorithms strategies Performance property Condition Confidence Severity Performance analysis strategies Itanium2 Stall Cycle Analysis IBM POWER6 Single Core Performance Analysis MPI Communication Pattern Analysis Generic Memory Strategy OpenMP-based Performance Analysis Scalability Analysis OpenMP codes
Technische Universität München POWER6 Single Core Performance Properties Hot spot of the application Cycles lost due to cache misses Average amount of cycles lost per L1 miss High L1 demand load miss rate High L2 demand load miss rate High L3 demand load miss rate Cycles lost due to address translation misses Cycles lost due to store instructions Cycles lost due to Floating Point instructions inefficiencies Cycles lost due to Integer multiplications and divisions Cycles lost due to no instruction to dispatch
Technische Universität München Itanium2 Stall Cycle Properties IA64 Pipeline Stall Cycles Stalls due to pipeline flush Stalls due to branch misprediction flush Stalls due to exception flush Stalls due to floating point exceptions or L1D TLB misses Stalls due to Flush to zero or SIR stalls Stalls due to L1D TLB misses ... Stalls due to waiting for data delivery to register Stalls due to waiting for integer register Stalls due to waiting for integer results Stalls due to waiting for FP register Stalls due to waiting for integer loads L3 misses dominate data access L2 misses L3 misses Stalls due to register stack engine
Technische Universität München MPI Communication Patterns Analysis Automatic detection of wait patterns Measurement on the fly No tracing required! MPI_Recv p1 p2 MPI_Send
Technische Universität München MPI Performance Properties Excessive MPI time in receive due to late sender Excessive MPI time due to late root in broadcast Excessive MPI time in root due to late process in reduce Excessive MPI time in ... (1xN, Nx1, 1x1, NxN) Excessive MPI time due to many small messages Excessive MPI communication time
Technische Universität München OpenMP-based Performance Properties Searches OpenMP-based perf. problems in a single step Properties are divided into four major domains Startup and Shutdown Overhead Load Imbalance in OpenMP regions: Parallel region Parallel loop Explicit barrier Parallel sections Not enough sections Uneven sections Seq. Computation in parallel regions: Master region Single region Ordered loop OpenMP Synchronization properties Critical section overhead property Frequent atomic property OpenMP Tasking analysis under development
Technische Universität München Scalability Analysis OpenMP codes Identifies the OpenMP code regions that do not scale well Scalability Analysis is done by the frontend No need to manually configure the runs and find the speedup! Extracts information from the Frontend initialization found properties After Does Scalability Analysis Frontend.run() 2 n i. Starts application runs Exports the Properties ii.Starts analysis agents iii.Receives found properties GUI-based Analysis n
Technische Universität München Scalability Analysis Properties Meta-Properties Properties occurring in all configurations Property with increasing severity across the configurations Speedup-based Prop. Linear Speedup Super linear Speedup Linear Speedup failed for the first time Speedup Decreasing Exp. specific properties Code region with the lowest speedup Low Speedup based on a threshold
Technische Universität München Graphical User Interface Integrates with the Eclipse Development Platform Open-source, extensible and very popular IDE Supports different programming languages: C/C++, Fortran, etc. Uses the Eclipse Parallel Tools Platform (PTP) which provides a higher-level abstraction of the underlying parallel system Designed to combine: Performance measurement functionality of Periscope Advanced IDE functions like code indexing, refactoring, etc. Features Multi-functional table to display the detected bottlenecks Outline of the instrumented code regions Clustering techniques to get classes of similarly behaving processes Supports both local and remote projects Higher-level configuration and execution of performance experiments
Technische Universität München Graphical User Interface Source code view SIR outline view Project view Properties view
Technische Universität München Periscope GUI: Properties Table Simple and clean tree-based overview Multi-level grouping Complex data filtering Multiple criteria sorting algorithm Navigation from the properties to their source code location
Technische Universität München Periscope GUI: Instrumentation Outline Resembles the code outline view of the Eclipse C/C++ Development Tooling Outlines the instrumented code regions and their nesting Shows the number of properties in each region Assists code navigation Filters the displayed properties
Technische Universität München Periscope GUI: RDT and EFS Eclipse File System (EFS) Abstracts the underlying file system details Any supported file system can be used: Remote projects using SSH/FTP/DStore, Local, Zip, etc. Source files of the analyzed application reside only on the remote no need for synchronization Remote Development Tools (RDT) Part of Eclipse Parallel Tools Platform (PTP) Project Remote Compilation Remote Indexing Currently supports only C/C++ applications
Technische Universität München Periscope GUI: Experiment Configuration External Tools Framework (ETFw) Part of Eclipse Parallel Tools Platform (PTP) Project More convenient environment using ETFw's Profile launch configuration no terminal access needed higher level configuration and automation possible
Technische Universität München Clustering support Cluster 2 Properties summarization Property 1 Metaproperties Identify hidden behavior Cluster 1 Weka workbench in the GUI: Property 2 Waikato Environment for Knowledge Analysis Cluster 3 Uses K-Means algorithm Groups properties based on CPU distribution and code region Property 3 Results shown in a table view similar to the properties view Cluster 1 CPUs: 7-10,16 Cluster 2 CPUs: 2-3,5,11,13-14 Done also on the fly in the Cluster 3 CPUs:1,4,6,12,15 hierarchy
Technische Universität München Thank you for your attention! Current version 1.3 (New BSD License) Available under: http://www.lrr.in.tum.de/periscope/Download Supported architectures SGI Altix 4700 Itanium2 IBM Power575 POWER6 x86-based architectures BlueGene/P under development Further information: Periscope web page: http://www.lrr.in.tum.de/periscope
Recommend
More recommend