A simple Concept for the Performance Analysis of Cluster-Computing H. Kredel 1 , S. Richling 2 , J.P . Kruse 3 , E. Strohmaier 4 , H.G. Kruse 1 1 IT-Center, University of Mannheim, Germany 2 IT-Center, University of Heidelberg, Germany 3 Institute of Geosciences, Goethe University Frankfurt, Germany 4 Future Technology Group, LBNL, Berkeley, USA ISC’13, Leipzig, 18. June 2013
Outline Introduction Performance Model Applications Scalar-Product of Vectors Matrix Multiplication Linpack TOP500 Conclusions
Introduction Motivation ◮ Sophisticated mathematical models for performance analysis cannot keep up with rapid hardware development. ◮ There is a lack of reliable rules of thumb to estimate the size and performance of clusters. Goals ◮ Development of a simple and transparent model. ◮ Restriction to few parameters describing hardware and software. ◮ Using speed-up as a dimensionless metric. ◮ Finding the optimal size of a cluster for a given application. ◮ Validation of the results by modeling of standard kernels.
Related Work ◮ Roofline model for multi-cores (Williams et al. 2009) ◮ Performance models by Hockney: ◮ Model with few hardware and software parameters, focus on benchmark runtimes and performance (Hockney 1987, Hockney & Jesshope 1988) ◮ Model based on similarities to fluid dynamics (Hockney 1995) ◮ Performance models by Numrich: ◮ Based on Newtons classical mechanics (Numrich 2007) ◮ Based on dimension analysis (Numrich 2008) ◮ Based on the Pi theorem (Numrich 2010) ◮ Linpack performance model (Luszczek & Dongarra 2011) ◮ Performance model based on a stochastic approach (Kruse 2009, Kredel et al. 2010) ◮ Performance model for interconnected clusters (Kredel et al. 2012)
Model Parameters Hardware Parameters l peak l peak l peak l peak l peak · · · · · · p 1 2 3 4 p number of processing units (PUs) l peak theoretical peak performance of each PU k = 1 , p b c bandwidth of the network Software Parameters # op total number of arithmetic operations # b total number of bytes involved # x total number of bytes communicated between the PUs
Distribution of the work load ( # op , # b ) Homogeneous case • Distribution of operations # op o 1 o 2 o 3 o 4 o p · · · · · · o k = # op / p ( or ω k = 1 / p ) • Distribution of data # b d p d 1 d 2 d 3 d 4 · · · · · · d k = # b / p ( or δ k = 1 / p )
Distribution of the work load ( # op , # b ) Heterogeneous case → additional parameters ( ω k , δ k ) • Distribution of operations # op o 1 o 2 o 3 o 4 o p · · · · · · p � o k = ω k · # op with ω k = 1 k = 1 • Distribution of data # b d p d 1 d 2 d 3 d 4 · · · · · · p � d k = δ k · # b with δ k = 1 k = 1
Performance Indicators Primary performance measure t Total time to process the work load (# op , # b ) Derived performance measures l ( p ) = # op Performance t S = l ( p ) Speed-up (dimensionless) l ( 1 ) Goal: Speed-up as a function of ◮ total work load (# op , # b ) [ Flop , Byte ] ◮ work distribution ( ω k , δ k ) ◮ communication requirements # x [Byte] ◮ hardware parameters ( p , l peak , b c ) [-,Flop/s, Byte] k
Total execution time Computation time t r = max { t 1 ( o 1 , d 1 ) , . . . , t n ( o p , d p ) } ≃ o k ≥ o k l k l peak k Communication time t c ≃ # x b c Total execution time t ≃ t r + t c t ≥ o k + # x l peak b c k
Total execution time � � l peak t ≥ ω k · # op + # x b c = ω k · # op ω k # op · # x # b · 1 + k b c · l peak l peak # b k k � � t ≥ ω k · # op 1 + 1 · l peak x k k One dimensionless parameter for “hardware + software” x k = ω k · a · r a ∗ k a = # op computational intensity of the software [Float/Byte] # b k = l peak k a ∗ ”computational intensity” of the hardware [Float/Byte] b c r = # b ”inverse communication intensity” [-] # x
Performance and Speed-up Performance ≤ l peak l = # op x k k · t ω k 1 + x k Speed-up S = l ( p ) l ( 1 ) = l k ( ω k < 1 ) 1 + x k ( ω k = 1 ) l k ( ω k = 1 ) = 1 + ω k · x k ( ω k = 1 ) · r = a · b 0 x k ( ω k = 1 ) = a · r = a · b c · b c c · r = ˆ x k · z · r b 0 a ∗ l peak l peak k c k k 1 + ˆ x k · r · z S = general case with ω k = ω ( k , p ) / p 1 + ω ( k , p ) · ˆ x k · r · z p S = 1 + ˆ x · r · z homogeneous case with ω ( k , p ) = 1 1 + ˆ x · r · z p
Application-oriented Analysis Application characterized by problem size n . Software Parameters # op → # op ( n ) # b → # b ( n ) # x → # x ( n , p ) Analysis of the performance of a homogeneous cluster x r ( n , p ) l ≤ p l peak x + 1 = l peak y · 1 + y r ( n , p ) p x · z · r ( n , p ) / p = y · r ( n , p ) / p ≃ y · c ( n ) 1 With x = ˆ d ( p ) p ◮ Number of PUs p 1 / 2 necessary to reach half of the maximum performance of all p PUs. 2 pl peak → y · r ( n , p 1 / 2 ) = p 1 / 2 l ( p 1 / 2 ) = 1 ◮ Number of PUs p to obtain the maximum of the performance dl dp = 0 → p 2 max · d ′ ( p max ) = y = ˆ x · z · c ( n )
Compute resources for the simulations bwGRiD Cluster Site Nodes Frankfurt Mannheim 140 Heidelberg 140 Karlsruhe 140 (interconnected to a single cluster) Stuttgart 420 Mannheim Heidelberg T¨ ubingen 140 Ulm/Konstanz 280 Karlsruhe Freiburg 140 Stuttgart Esslingen 180 Total 1580 Esslingen Tübingen Ulm (joint cluster with Konstanz) München Freiburg
bwGRiD – Hardware Node Configuration ◮ 2 Intel Xeon CPUs, 2.8 GHz (each CPU with 4 Cores) ◮ 16 GB Memory ◮ 140 GB hard drive (since January 2009) ◮ InfiniBand Network (20 Gbit/sec) Hardware parameters for our model l peak = 8 GFlop/sec (for one core) b c = 1 . 5 GByte/sec (node-to-node) b 0 = 1 . 0 GByte/sec (reference bandwidth) c
Scalar-Product of two Vectors � ( u , v ) = u k · v k k Software Parameters # op = 2 n − 1 ≃ 2 n if n ≫ 1 # b = 2 n w # x = p w = 8 p Speed-up 1 + x 64 · n 3 S = with x = p 1 + x / p Simulations ◮ Vector sizes up to n = 10 7 ◮ 20 runs for each configuration ( p , n ) ◮ Speed-up calculated from mean run-times
Speed-up for Scalar Product scalarproduct with size n 450 n = 10 5 , experimental 400 n = 10 5 , theoretical n = 5 × 10 5 , experimental 350 n = 5 × 10 5 , theoretical n = 10 6 , experimental 300 n = 10 6 , theoretical n = 10 7 , experimental 250 n = 10 7 , theoretical S(p) 200 150 100 50 0 -50 50 100 150 200 250 300 350 400 450 500 p
Matrix Multiplication A n × n · B n × n = C n × n on a √ p · √ p processor-grid Software Parameters # op = 2 n 3 − n 2 ≃ 2 n 3 # b = 2 n 2 w # x = 2 n 2 √ p ( 1 − √ p ) w ≃ 2 n 2 w √ p 1 Speed-up 1 + x 2048 n √ p 3 S = with x = 1 + x / p Simulations ◮ Matrix sizes up to n = 40000 ◮ Cannon’s algorithm ◮ Runs with 8 and 4 cores per node
Speed-up for Matrix Multiplication
Linpack Solution of Ax = b Software Parameters # op = 2 3 n 3 # b = 2 n 2 · w � � 1 + log 2 p n 2 · w # x = 3 α 12 Speed-up 1 + x n S ∼ with x = 128 and α = 1 / 3 1 + x / p Simulations ◮ Matrix sizes up to 40000. ◮ Smaller α would lead to better fits for small p .
Speed-up for Linpack
Linpack on bwGRiD Half of Peak performance at: p 1 / 2 = y n 3 α = 128 Maximum performance at: p max = ( 24 · ln 2 / 128 ) · n = 24 ln ( 2 ) p 1 / 2 Region with ’good’ performance for n = 10000 p = [ p 1 / 2 , p max ] = [ 80 , 1300 ] Maximum performance l max = ∼ l peak y 9 3 α 10 l max = 560 GFlop/sec for n = 10000
TOP500 Maximum performance l max = n · b c 9 3 w 10 In TOP500 list: l max → R max and n → N max Bandwidth b c not in the list. Derive Effective Bandwidth c = R max · 3 w · 10 b eff N max 9 Analyze which parameter predicts ranking best ◮ first 100 systems ◮ excluding systems with accelerators and missing N max ◮ comparison with single core performance l peak = R max / p max
TOP500 – November 2011 Blue: Linpack-Performance per core Red: Derived effective Bandwidth 35 30 b_c^eff [GByte/sec] l^th [GFlop/sec] 25 20 15 10 5 0 1 3 7 8 9 11 12 14 15 17 22 24 26 27 28 29 38 39 41 42 43 45 46 47 48 51 52 54 55 56 57 60 61 64 65 66 68 72 73 77 78 81 83 84 85 86 90 93 95 98 Rank in TOP500 list (Nov. 2011)
TOP500 – November 2012 Blue: Linpack-Performance per core Red: Derived effective Bandwidth 40 b_c^eff [GByte/sec] l^th [GFlop/sec] 35 30 25 20 15 10 5 0 2 3 5 6 11 14 15 19 20 21 24 25 27 28 29 39 45 49 54 55 56 61 63 64 69 70 71 74 77 80 82 83 85 88 92 93 94 95 96 97100 Rank in TOP500 List (November 2012)
Conclusions ◮ Developed a performance model which integrates the characteristics of hardware and software with a few parameters. ◮ Model provides simple formulae for performance and speed-up. ◮ Results compare reasonably well with simulations of standard applications. ◮ Model allows estimation of the optimal size of a cluster for a given class of applications. ◮ Model allows estimation of the maximum performance for a given class of applications. ◮ Identified effective bandwidth as a key performance indicator for Linpack (TOP500) on compute clusters. ◮ Future work: ◮ Analysis of inhomogeneous clusters with asymmetric load distribution ◮ Further applications: Sparse matrix-vector operations and FFT
Recommend
More recommend