VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Insightful Automatic Performance Modeling Alexandru Calotoiu 1 , Torsten Hoefler 2 , Martin Schulz 3 , Sergei Shudler 1 and Felix Wolf 1 1 TU Darmstadt , 2 ETH Zürich , 3 Lawrence Livermore National Laboratory
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Sponsors INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 2
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Virtual Institute – High Productivity Supercomputing Association of HPC programming tool builders Mission � Development of portable programming tools that assist programmers in diagnosing programming errors and optimizing the performance of their applications � Integration of these tools � Organization of training events designed to teach the application of these tools � Organization of academic workshops to facilitate the exchange of ideas on tool development and to promote young scientists www.vi-hps.org
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Motivation - latent scalability bugs System size Execution time INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 4
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Learning objectives � Performance modeling background � Automatic performance modeling with Extra-P � How it works � When it doesn’t work � Practical experiences with � Prepared examples � Your own data INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 5
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Talk structure � Introduction � Background � Automatic performance modeling � Theory � Performance Model Normal Form (PMNF) � Assumptions & limitations � Practice � Workflow � Model refinement � Examples � Case studies � Discussion INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 6
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Introduction
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Outline � Performance analysis methods � Analytical performance modeling � Automatic performance modeling � Scalability validation framework INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 8
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Spectrum of performance analysis methods Benchmark Full simulation Model simulation Model Number of parameters Model error INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 9
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Scaling model 21 18 c ` 15 p 2 0 ´ 4 1 12 Time r s s ¨ 3 � Represents performance 9 metric as a function of the number of processes 6 � Provides insight into the 3 program behavior at scale 0 2 9 2 10 2 11 2 12 2 13 Processes INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 10
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Pitfalls Intuition is not enough 2.95*log 2 p + 0.0871* p 12.06* p INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 11
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Analytical performance modeling • Parts of the program that dominate its performance at larger scales Identify • Identified via small-scale tests and intuition kernels • Laborious process • Still confined to a small community of skilled Examples: Create experts models Hoisie et al.: Performance and scalability analysis of teraflop-scale parallel architectures using multi- dimensional wavefront applications. International Disadvantages: Journal of High Performance Computing Applications, � Time consuming 2000 � Danger of overlooking unscalable code Bauer et al.: Analysis of the MILC Lattice QCD Application su3_rmd . CCGrid, 2012 INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 12
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Automatic performance modeling with Extra-P Performance measurements main() { foo() bar() M i M j compute() Instrumentation } Extra-P Input Output Human-readable performance models of all functions (e.g., t = c 1 *log(p) + c 2 ) INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 13
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Automatic performance modeling with Extra-P Performance measurements (profiles) main() { foo() p 1 = 128 p 4 = 1,024 bar() p 2 = 256 p 5 = 2,048 compute() Instrumentation } p 3 = 512 p 6 = 4,096 • All functions Input Extra-P Output 1. foo 2. compute Ranking: 3. main • Target scale p t 4. bar • Asymptotic […] INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 14
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Requirements modeling Program Computation Communication … FLOPS Store Load P2P Collective Disagreement may be indicative of wait states Time INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 15
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Algorithm engineering Courtesy of Peter Sanders, KIT INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 16
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING How to validate scalability in practice? Small Real Program text book application example Verifiable Asymptotic Expectation analytical complexity expression #FLOPS = n 2 (2n − 1) #FLOPS = O(n 2.8074 ) INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 17
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Scalability evaluation framework Search space Model Benchmark generation generation Scaling Expectation model Performance measurements + optional deviation limit Shudler et al: Exascaling Divergence model Your Library: Will Your Implementation Meet Your Expectations?. Initial Comparing Regression validation alternatives testing 2015 INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 18
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Theory
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Outline � Goal – scaling trends � Model generation � Performance Model Normal Form (PMNF) � Statistical quality control & confidence intervals � Assumptions & limitations of the method INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 20
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Automatic performance modeling Performance measurements main() { foo() bar() M i M j compute() Instrumentation } • All functions Extra-P Input Output Human-readable performance models of all functions (e.g., t = c 1 *log(p) + c 2 ) INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 21
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Primary focus on scaling trend Common performance analysis chart in a paper Common performance Ranking analysis chart in a paper 1. F 2 2. F 1 3. F 3 INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 22
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Primary focus on scaling trend Actual measurement in laboratory conditions Common performance Ranking analysis chart in a paper 1. F 2 2. F 1 3. F 3 INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 23
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Primary focus on scaling trend Production Reality Ranking 1. F 2 2. F 1 3. F 3 INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 24
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Model building blocks LU LU t ( p ) ~ c t ( p ) ~ c Communication Computation FFT FFT t ( p ) ~ log 2 ( p ) t ( p ) ~ c Naïve N-body Naïve N-body t ( p ) ~ p t ( p ) ~ p … … Samplesort Samplesort t ( p ) ~ p 2 log 2 2 ( p ) t ( p ) ~ p 2 INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 25
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Performance model normal form n ∈ n j k ( p ) c k ⋅ p i k ⋅ log 2 ∑ i k ∈ I f ( p ) = j k ∈ J k = 1 I , J ⊂ c 1 c 1 ⋅ log( p ) n = 1 c 1 ⋅ p c 1 ⋅ p ⋅ log( p ) I = 0,1,2 { } c 1 ⋅ p 2 ⋅ log( p ) c 1 ⋅ p 2 J = {0,1} INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 26
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Performance model normal form n ∈ n c 1 ⋅ log( p ) + c 2 ⋅ p j k ( p ) c k ⋅ p i k ⋅ log 2 ∑ i k ∈ I f ( p ) = c 1 ⋅ log( p ) + c 2 ⋅ p ⋅ log( p ) j k ∈ J c 1 ⋅ log( p ) + c 2 ⋅ p 2 k = 1 I , J ⊂ c 1 + c 2 ⋅ p c 1 ⋅ log( p ) + c 2 ⋅ p 2 ⋅ log( p ) c 1 + c 2 ⋅ p 2 c 1 ⋅ p + c 2 ⋅ p ⋅ log( p ) c 1 + c 2 ⋅ log( p ) c 1 ⋅ p + c 2 ⋅ p 2 c 1 c 1 ⋅ log( p ) c 1 + c 2 ⋅ p ⋅ log( p ) c 1 ⋅ p + c 2 ⋅ p 2 ⋅ log( p ) n = 1 c 1 ⋅ p c 1 ⋅ p ⋅ log( p ) c 1 + c 2 ⋅ p 2 ⋅ log( p ) c 1 ⋅ p ⋅ log( p ) + c 2 ⋅ p 2 I = 0,1,2 { } c 1 ⋅ p 2 ⋅ log( p ) c 1 ⋅ p 2 c 1 ⋅ p ⋅ log( p ) + c 2 ⋅ p 2 ⋅ log( p ) J = {0,1} c 1 ⋅ p 2 + c 2 ⋅ p 2 ⋅ log( p ) INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 27
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Weak vs. strong scaling � Wall-clock time not necessarily monotonically increasing under strong scaling � Harder to capture model automatically � Different invariants require different reductions across processes Weak scaling Strong scaling Invariant Problem size per process Overall problem size Model target Wall-clock time Accumulated time Reduction Maximum / average Sum INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 28
Recommend
More recommend