Update on the Performance-Modeling Tool Extra-P Felix Wolf, TU Darmstadt
Acknowledgement • David Beckingsale • Alexandru Calotoiu • Christopher W. Earl • Torsten Hoefler • Kashif Ilyas • Ian Karlin • Daniel Lorenz • Patrick Reisert • Martin Schulz • Sergei Shudler • Andreas Vogel 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 2
Latent scalability bugs System size Wall time 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 3
Motivation Performance model = formula that expresses relevant performance metrics as a function of one or more execution parameters 21 Manual creation challenging 18 • Incomplete 3 ¨ 10 ´ 4 p 2 ` c 15 coverage Identify kernels 12 Time r s s 9 • Laborious, difficult Create 6 models 3 0 2 9 2 10 2 11 2 12 2 13 Processes 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 4
Automatic empirical performance modeling n j k ( p ) c k ⋅ p i k ⋅ log 2 ∑ f ( p ) = k = 1 Small-scale measurements Performance model normal form (PMNF) c 1 ⋅ log( p ) + c 2 ⋅ p c 1 ⋅ log( p ) + c 2 ⋅ p ⋅ log( p ) c 1 ⋅ log( p ) + c 2 ⋅ p 2 c 1 + c 2 ⋅ p c 1 ⋅ log( p ) + c 2 ⋅ p 2 ⋅ log( p ) Kernel Model [s] c 1 + c 2 ⋅ p 2 c 1 ⋅ p + c 2 ⋅ p ⋅ log( p ) [2 of 40] t = f(p) c 1 ⋅ p + c 2 ⋅ p 2 c 1 + c 2 ⋅ log( p ) c 1 ⋅ p + c 2 ⋅ p 2 ⋅ log( p ) c 1 + c 2 ⋅ p ⋅ log( p ) sweep → c 1 + c 2 ⋅ p 2 ⋅ log( p ) c 1 ⋅ p ⋅ log( p ) + c 2 ⋅ p 2 4.03 p c 1 ⋅ p ⋅ log( p ) + c 2 ⋅ p 2 ⋅ log( p ) MPI_Recv c 1 ⋅ p 2 + c 2 ⋅ p 2 ⋅ log( p ) sweep 582.19 Generation of candidate models and selection of best fit 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 5
Extra-P 3.0 • GUI improvements, better stability, additional features • Tutorials available through VI-HPS and upon request http://www.scalasca.org/software/extra-p/download.html 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 6
Recent developments 1. Performance models with multiple parameters 2. Automatic configuration of the search space 3. Segmented models 4. Iso-efficiency modeling 5. Lightweight requirements engineering for co-design 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 7
Models with more than one parameter n m n = 3 j kl ( x l ) i kl ⋅ log 2 ∑ ∏ m = 3 f ( x 1 ,.., x m ) = c k x l ⎧ ⎫ I = 0 4, 1 4,...,12 ⎨ ⎬ ⎩ 4 ⎭ k = 1 l = 1 J = {0,1,2} Search space explosion • Total number of hypotheses to search: 34.786,300,841,019 • Too slow for any practical purpose 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 8
Search space reduction through heuristics • Hierarchical search – Assumes the best multi- parameter model is created out of the combination of the best single parameter hypothesis for each parameter • Modified golden section search – Speeds up the single parameter search by ordering the hypothesis space and then using a variant of binary search to find the model in logarithmic time rather than linear time Calotoiu et al. 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 9
Search space reduction • Assuming 300.000 hypotheses searched per second* n = 3 • 3-parameter models m = 3 ⎧ ⎫ I = 0 4, 1 4,...,12 ⎨ ⎬ ⎩ 4 ⎭ J = {0,1,2} 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 10
Search space reduction • Assuming 300.000 hypotheses searched per second* n = 3 • 3-parameter models m = 3 ⎧ ⎫ I = 0 4, 1 4,...,12 *This is optimistic ⎨ ⎬ ⎩ 4 ⎭ J = {0,1,2} Exhaustive search 34.786.300.841.019 hypotheses searched ~1 model / 3.5 years 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 11
Search space reduction • Assuming 300.000 hypotheses searched per second* n = 3 • 3-parameter models m = 3 ⎧ ⎫ I = 0 4, 1 4,...,12 *This is optimistic ⎨ ⎬ ⎩ 4 ⎭ J = {0,1,2} Exhaustive search 34.786.300.841.019 27.929 hypotheses hypotheses searched searched ~1 model / 3.5 years ~11 models / second 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 12
Search space reduction • Assuming 300.000 hypotheses searched per second* n = 3 • 3-parameter models m = 3 ⎧ ⎫ I = 0 4, 1 4,...,12 *This is optimistic ⎨ ⎬ ⎩ 4 ⎭ J = {0,1,2} Exhaustive + search 34.786.300.841.019 27.929 590 hypotheses hypotheses hypotheses searched searched searched ~1 model / 3.5 years ~11 models / second ~508 models / second 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 13
Evaluation with synthetic data (100,000 models with two parameters) Distribution of generated models [%] 100 90 Exhaustive search - 107 hours 80 70 Heuristics - 1.5 hours 60 50 40 30 20 10 0 Optimal model Lead-order term Lead-order term not identified identified identified 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 14
Evaluation with application data Distribution of generated models [%] 100 90 80 70 Identical models 60 Lead-order terms identical 50 40 Different lead-order terms 30 20 10 0 Blast (full) Blast (partial) CloverLeaf Kripke 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 15
Case study – Kripke • Neutron transport proxy code • Three parameters considered • Process count – p • Number of directions – d • Number of groups – g 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 16
Expected behavior SweepSolver MPI_Testany Main computation kernel Main communication kernel: 3D wave-front communication pattern Expectation – Performance depends on Expectation – Performance depends on problem size cubic root of process count t ~ p t ~ d ⋅ g 3 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 17
Expected behavior SweepSolver MPI_Testany Main computation kernel Main communication kernel: 3D wave-front communication pattern Expectation – Performance depends on Expectation – Performance depends on problem size cubic root of process count Kernels must wait on t ~ p t ~ d ⋅ g 3 each other Actual model: Actual model: t = 5 + d ⋅ g + 0.005 ⋅ p ⋅ d ⋅ g t = 7 + p + 0.005 ⋅ p ⋅ d ⋅ g 3 3 3 Smaller compounded effect discovered 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 18
How to find good PMNF parameters? Option (1) : Rely on default parameters → But what if they don‘t fit the problem? Option (2): Try those parameters that you expect to fit → Requires prior expertise! Also, what if your expectation is wrong? Option (3) : Try very large sets I, J → Requires more resources (especially bad for multiple parameters)! Option (4) : Let Extra-P automatically refine the search space based on previous results. 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 19
Simplified PMNF • Use only constant and “lead order” term • Want to find values for c ₀ , c ₁ , α, and β, such that model error is minimized • c ₀ and c ₁ are determined by regression • What about α and β? 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 20
Simplified PMNF We define four slices: • β = 0, α = ? • β = 1, α = ? • β = 2, α = ? • α = 0, β = ? Goal: Unimodal error distribution along each slice 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 21
Evaluation Data from previous case studies Results • Sweep3D • 4453 models • MILC • 49% remain unchanged • UG4 • 39% get better • MPI collective operations • 12% get worse • BLAST • Mean relative prediction down from 45.7% to 13.0% • Kripke • Improvements in every individual • 5–9 points available case study • Last data point (largest p) not used for modeling, but to evaluate prediction accuracy Reisert et al. 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 22
Segmented behavior Second Model behaviour: predicted by 30 + p Extra-P: log 22 (p) Runtime First p 2 behaviour: 30 + p p 2 2 (p) l og 2 Number of processors (p) 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 23
Divide data into subsets Subset 3 Subset 6 Subset 2 Runtime Subset 1 p 2 30 + p Number of processors (p) 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 24
Model each subset and compute nRSS High nRSS Normalized RSS values Heterogeneous subsets 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 25
Recommend
More recommend