A simple Concept for the Performance Analysis of Cluster-Computing - PowerPoint PPT Presentation

A simple Concept for the Performance Analysis of Cluster-Computing H. Kredel 1 , S. Richling 2 , J.P . Kruse 3 , E. Strohmaier 4 , H.G. Kruse 1 1 IT-Center, University of Mannheim, Germany 2 IT-Center, University of Heidelberg, Germany 3 Institute of Geosciences, Goethe University Frankfurt, Germany 4 Future Technology Group, LBNL, Berkeley, USA ISC’13, Leipzig, 18. June 2013

Outline Introduction Performance Model Applications Scalar-Product of Vectors Matrix Multiplication Linpack TOP500 Conclusions

Introduction Motivation ◮ Sophisticated mathematical models for performance analysis cannot keep up with rapid hardware development. ◮ There is a lack of reliable rules of thumb to estimate the size and performance of clusters. Goals ◮ Development of a simple and transparent model. ◮ Restriction to few parameters describing hardware and software. ◮ Using speed-up as a dimensionless metric. ◮ Finding the optimal size of a cluster for a given application. ◮ Validation of the results by modeling of standard kernels.

Related Work ◮ Roofline model for multi-cores (Williams et al. 2009) ◮ Performance models by Hockney: ◮ Model with few hardware and software parameters, focus on benchmark runtimes and performance (Hockney 1987, Hockney & Jesshope 1988) ◮ Model based on similarities to fluid dynamics (Hockney 1995) ◮ Performance models by Numrich: ◮ Based on Newtons classical mechanics (Numrich 2007) ◮ Based on dimension analysis (Numrich 2008) ◮ Based on the Pi theorem (Numrich 2010) ◮ Linpack performance model (Luszczek & Dongarra 2011) ◮ Performance model based on a stochastic approach (Kruse 2009, Kredel et al. 2010) ◮ Performance model for interconnected clusters (Kredel et al. 2012)

Model Parameters Hardware Parameters l peak l peak l peak l peak l peak · · · · · · p 1 2 3 4 p number of processing units (PUs) l peak theoretical peak performance of each PU k = 1 , p b c bandwidth of the network Software Parameters # op total number of arithmetic operations # b total number of bytes involved # x total number of bytes communicated between the PUs

Distribution of the work load ( # op , # b ) Homogeneous case • Distribution of operations # op o 1 o 2 o 3 o 4 o p · · · · · · o k = # op / p ( or ω k = 1 / p ) • Distribution of data # b d p d 1 d 2 d 3 d 4 · · · · · · d k = # b / p ( or δ k = 1 / p )

Distribution of the work load ( # op , # b ) Heterogeneous case → additional parameters ( ω k , δ k ) • Distribution of operations # op o 1 o 2 o 3 o 4 o p · · · · · · p � o k = ω k · # op with ω k = 1 k = 1 • Distribution of data # b d p d 1 d 2 d 3 d 4 · · · · · · p � d k = δ k · # b with δ k = 1 k = 1

Performance Indicators Primary performance measure t Total time to process the work load (# op , # b ) Derived performance measures l ( p ) = # op Performance t S = l ( p ) Speed-up (dimensionless) l ( 1 ) Goal: Speed-up as a function of ◮ total work load (# op , # b ) [ Flop , Byte ] ◮ work distribution ( ω k , δ k ) ◮ communication requirements # x [Byte] ◮ hardware parameters ( p , l peak , b c ) [-,Flop/s, Byte] k

Total execution time Computation time t r = max { t 1 ( o 1 , d 1 ) , . . . , t n ( o p , d p ) } ≃ o k ≥ o k l k l peak k Communication time t c ≃ # x b c Total execution time t ≃ t r + t c t ≥ o k + # x l peak b c k

Total execution time � � l peak t ≥ ω k · # op + # x b c = ω k · # op ω k # op · # x # b · 1 + k b c · l peak l peak # b k k � � t ≥ ω k · # op 1 + 1 · l peak x k k One dimensionless parameter for “hardware + software” x k = ω k · a · r a ∗ k a = # op computational intensity of the software [Float/Byte] # b k = l peak k a ∗ ”computational intensity” of the hardware [Float/Byte] b c r = # b ”inverse communication intensity” [-] # x

Performance and Speed-up Performance ≤ l peak l = # op x k k · t ω k 1 + x k Speed-up S = l ( p ) l ( 1 ) = l k ( ω k < 1 ) 1 + x k ( ω k = 1 ) l k ( ω k = 1 ) = 1 + ω k · x k ( ω k = 1 ) · r = a · b 0 x k ( ω k = 1 ) = a · r = a · b c · b c c · r = ˆ x k · z · r b 0 a ∗ l peak l peak k c k k 1 + ˆ x k · r · z S = general case with ω k = ω ( k , p ) / p 1 + ω ( k , p ) · ˆ x k · r · z p S = 1 + ˆ x · r · z homogeneous case with ω ( k , p ) = 1 1 + ˆ x · r · z p

Application-oriented Analysis Application characterized by problem size n . Software Parameters # op → # op ( n ) # b → # b ( n ) # x → # x ( n , p ) Analysis of the performance of a homogeneous cluster x r ( n , p ) l ≤ p l peak x + 1 = l peak y · 1 + y r ( n , p ) p x · z · r ( n , p ) / p = y · r ( n , p ) / p ≃ y · c ( n ) 1 With x = ˆ d ( p ) p ◮ Number of PUs p 1 / 2 necessary to reach half of the maximum performance of all p PUs. 2 pl peak → y · r ( n , p 1 / 2 ) = p 1 / 2 l ( p 1 / 2 ) = 1 ◮ Number of PUs p to obtain the maximum of the performance dl dp = 0 → p 2 max · d ′ ( p max ) = y = ˆ x · z · c ( n )

Compute resources for the simulations bwGRiD Cluster Site Nodes Frankfurt Mannheim 140 Heidelberg 140 Karlsruhe 140 (interconnected to a single cluster) Stuttgart 420 Mannheim Heidelberg T¨ ubingen 140 Ulm/Konstanz 280 Karlsruhe Freiburg 140 Stuttgart Esslingen 180 Total 1580 Esslingen Tübingen Ulm (joint cluster with Konstanz) München Freiburg

bwGRiD – Hardware Node Configuration ◮ 2 Intel Xeon CPUs, 2.8 GHz (each CPU with 4 Cores) ◮ 16 GB Memory ◮ 140 GB hard drive (since January 2009) ◮ InfiniBand Network (20 Gbit/sec) Hardware parameters for our model l peak = 8 GFlop/sec (for one core) b c = 1 . 5 GByte/sec (node-to-node) b 0 = 1 . 0 GByte/sec (reference bandwidth) c

Scalar-Product of two Vectors � ( u , v ) = u k · v k k Software Parameters # op = 2 n − 1 ≃ 2 n if n ≫ 1 # b = 2 n w # x = p w = 8 p Speed-up 1 + x 64 · n 3 S = with x = p 1 + x / p Simulations ◮ Vector sizes up to n = 10 7 ◮ 20 runs for each configuration ( p , n ) ◮ Speed-up calculated from mean run-times

Speed-up for Scalar Product scalarproduct with size n 450 n = 10 5 , experimental 400 n = 10 5 , theoretical n = 5 × 10 5 , experimental 350 n = 5 × 10 5 , theoretical n = 10 6 , experimental 300 n = 10 6 , theoretical n = 10 7 , experimental 250 n = 10 7 , theoretical S(p) 200 150 100 50 0 -50 50 100 150 200 250 300 350 400 450 500 p

Matrix Multiplication A n × n · B n × n = C n × n on a √ p · √ p processor-grid Software Parameters # op = 2 n 3 − n 2 ≃ 2 n 3 # b = 2 n 2 w # x = 2 n 2 √ p ( 1 − √ p ) w ≃ 2 n 2 w √ p 1 Speed-up 1 + x 2048 n √ p 3 S = with x = 1 + x / p Simulations ◮ Matrix sizes up to n = 40000 ◮ Cannon’s algorithm ◮ Runs with 8 and 4 cores per node

Speed-up for Matrix Multiplication

Linpack Solution of Ax = b Software Parameters # op = 2 3 n 3 # b = 2 n 2 · w � � 1 + log 2 p n 2 · w # x = 3 α 12 Speed-up 1 + x n S ∼ with x = 128 and α = 1 / 3 1 + x / p Simulations ◮ Matrix sizes up to 40000. ◮ Smaller α would lead to better fits for small p .

Speed-up for Linpack

Linpack on bwGRiD Half of Peak performance at: p 1 / 2 = y n 3 α = 128 Maximum performance at: p max = ( 24 · ln 2 / 128 ) · n = 24 ln ( 2 ) p 1 / 2 Region with ’good’ performance for n = 10000 p = [ p 1 / 2 , p max ] = [ 80 , 1300 ] Maximum performance l max = ∼ l peak y 9 3 α 10 l max = 560 GFlop/sec for n = 10000

TOP500 Maximum performance l max = n · b c 9 3 w 10 In TOP500 list: l max → R max and n → N max Bandwidth b c not in the list. Derive Effective Bandwidth c = R max · 3 w · 10 b eff N max 9 Analyze which parameter predicts ranking best ◮ first 100 systems ◮ excluding systems with accelerators and missing N max ◮ comparison with single core performance l peak = R max / p max

TOP500 – November 2011 Blue: Linpack-Performance per core Red: Derived effective Bandwidth 35 30 b_c^eff [GByte/sec] l^th [GFlop/sec] 25 20 15 10 5 0 1 3 7 8 9 11 12 14 15 17 22 24 26 27 28 29 38 39 41 42 43 45 46 47 48 51 52 54 55 56 57 60 61 64 65 66 68 72 73 77 78 81 83 84 85 86 90 93 95 98 Rank in TOP500 list (Nov. 2011)

TOP500 – November 2012 Blue: Linpack-Performance per core Red: Derived effective Bandwidth 40 b_c^eff [GByte/sec] l^th [GFlop/sec] 35 30 25 20 15 10 5 0 2 3 5 6 11 14 15 19 20 21 24 25 27 28 29 39 45 49 54 55 56 61 63 64 69 70 71 74 77 80 82 83 85 88 92 93 94 95 96 97100 Rank in TOP500 List (November 2012)

Conclusions ◮ Developed a performance model which integrates the characteristics of hardware and software with a few parameters. ◮ Model provides simple formulae for performance and speed-up. ◮ Results compare reasonably well with simulations of standard applications. ◮ Model allows estimation of the optimal size of a cluster for a given class of applications. ◮ Model allows estimation of the maximum performance for a given class of applications. ◮ Identified effective bandwidth as a key performance indicator for Linpack (TOP500) on compute clusters. ◮ Future work: ◮ Analysis of inhomogeneous clusters with asymmetric load distribution ◮ Further applications: Sparse matrix-vector operations and FFT

A simple Concept for the Performance Analysis of Cluster-Computing - PowerPoint PPT Presentation

A simple Concept for the Performance Analysis of Cluster-Computing H. Kredel 1 , S. Richling 2 , J.P . Kruse 3 , E. Strohmaier 4 , H.G. Kruse 1 1 IT-Center, University of Mannheim, Germany 2 IT-Center, University of Heidelberg, Germany 3 Institute

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

Introduction to Graph Cluster Analysis Outline Introduction to Cluster Analysis Types of

Kmean Cluster Analysis 1 Learning Objectives Understanding the kmean cluster analysis

CLUSTER ANALYSIS Agenda Introduction to cluster analysis and application Feature

Classification of curves Simple, not closed Simple, closed Closed, not simple Not simple, not

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

EDEN CLUSTER STATIONS EDEN CLUSTER STATIONS Density MUNICIPALITY SAPS STATION (inhabitants/km 2

Build Your Cluster with Rocks Build Your Cluster with Rocks Yu Fu Yu Fu University of Florida

Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Cluster Computing

Reaching the Goal with the Regensburg Marathon Cluster - A NetBSD Cluster Project - Hubert Feyrer

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

SINGLE-PEAKED PREFERENCES The Gibbard-Satterthwaite Theorem requires a full preference domain,

What is a real resonance we are searching for? Liu Kai special topic of Journal Club 1 Outline

MUSIC IN THE AIR The Years 8 & 9 took a special trip to Giangurra Park at the foot of the

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Century from Antarctic and Greenland Firn Air Measurements M. Aydin ( maydin@uci.edu ), S. A.

Epigenomic enrichment analysis using Bioconductor EuroBioc 2019 Brussels Dario Righelli

Energy and Angle Resolutions of Primary Leptons Dmitrii Torbunov April 2, 2019 1/15 Intro

Traffic management Edited and placed on a Web site For voice Accessed later by others

A simple Concept for the Performance Analysis of Cluster-Computing - PowerPoint PPT Presentation

A simple Concept for the Performance Analysis of Cluster-Computing H. Kredel 1 , S. Richling 2 , J.P . Kruse 3 , E. Strohmaier 4 , H.G. Kruse 1 1 IT-Center, University of Mannheim, Germany 2 IT-Center, University of Heidelberg, Germany 3 Institute

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

Introduction to Graph Cluster Analysis Outline Introduction to Cluster Analysis Types of

Kmean Cluster Analysis 1 Learning Objectives Understanding the kmean cluster analysis

CLUSTER ANALYSIS Agenda Introduction to cluster analysis and application Feature

Classification of curves Simple, not closed Simple, closed Closed, not simple Not simple, not

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

EDEN CLUSTER STATIONS EDEN CLUSTER STATIONS Density MUNICIPALITY SAPS STATION (inhabitants/km 2

Build Your Cluster with Rocks Build Your Cluster with Rocks Yu Fu Yu Fu University of Florida

Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Cluster Computing

Reaching the Goal with the Regensburg Marathon Cluster - A NetBSD Cluster Project - Hubert Feyrer

Computing Cluster Usage Visualization Tool Compu&amp;ng Cluster Usage Visualiza&amp;on

Computing Cluster Usage Visualization Tool Compu&amp;ng Cluster Usage Visualiza&amp;on

SINGLE-PEAKED PREFERENCES The Gibbard-Satterthwaite Theorem requires a full preference domain,

What is a real resonance we are searching for? Liu Kai special topic of Journal Club 1 Outline

MUSIC IN THE AIR The Years 8 &amp; 9 took a special trip to Giangurra Park at the foot of the

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Century from Antarctic and Greenland Firn Air Measurements M. Aydin ( maydin@uci.edu ), S. A.

Epigenomic enrichment analysis using Bioconductor EuroBioc 2019 Brussels Dario Righelli

Energy and Angle Resolutions of Primary Leptons Dmitrii Torbunov April 2, 2019 1/15 Intro

Traffic management Edited and placed on a Web site For voice Accessed later by others

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

MUSIC IN THE AIR The Years 8 & 9 took a special trip to Giangurra Park at the foot of the