Requirement Models for Co-Design Calotoiu Alexandru Dagstuhl Seminar| 23.10.2017 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 1
Automatic empirical modeling Performance measurements main() { foo() bar() Instrumentation M i M j compute() } Model Input generator Output Human-readable performance models of all functions (e.g., t = c 1 *log(p) + c 2 ) 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 2
Complexity building blocks LU LU t ( p ) ~ c t ( p ) ~ c Communication Computation FFT FFT t ( p ) ~ log 2 ( p ) t ( p ) ~ c Naïve N-body Naïve N-body t ( p ) ~ p t ( p ) ~ p … … Samplesort Samplesort t ( p ) ~ p 2 log 2 2 ( p ) t ( p ) ~ p 2 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 3
Performance model normal form n j k ( p ) c k ⋅ p i k ⋅ log 2 ∑ f ( p ) = k = 1 I , J ⊂ j k ∈ J n ∈ i k ∈ I 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 4
Creating search spaces n = 1 n j k ( p ) c k ⋅ p i k ⋅ log 2 ∑ f ( p ) = I = 0,1,2 { } J = {0,1} k = 1 c 1 c 1 ⋅ log( p ) c 1 ⋅ p c 1 ⋅ p ⋅ log( p ) c 1 ⋅ p 2 ⋅ log( p ) c 1 ⋅ p 2 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 5
Creating search spaces n = 2 n j k ( p ) c k ⋅ p i k ⋅ log 2 ∑ f ( p ) = I = 0,1,2 { } J = {0,1} k = 1 c 1 ⋅ p ⋅ log( p ) + c 2 ⋅ p 2 ⋅ log( p ) c 1 ⋅ log( p ) + c 2 ⋅ p 2 ⋅ log( p ) c 1 + c 2 ⋅ p c 1 ⋅ p 2 + c 2 ⋅ p 2 ⋅ log( p ) c 1 + c 2 ⋅ p 2 c 1 ⋅ p + c 2 ⋅ p ⋅ log( p ) c 1 ⋅ p + c 2 ⋅ p 2 c 1 + c 2 ⋅ log( p ) c 1 ⋅ log( p ) + c 2 ⋅ p c 1 ⋅ p + c 2 ⋅ p 2 ⋅ log( p ) c 1 + c 2 ⋅ p ⋅ log( p ) c 1 ⋅ log( p ) + c 2 ⋅ p ⋅ log( p ) c 1 + c 2 ⋅ p 2 ⋅ log( p ) c 1 ⋅ log( p ) + c 2 ⋅ p 2 c 1 ⋅ p ⋅ log( p ) + c 2 ⋅ p 2 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 6
Case study – HOMME Core of the Community Atmospheric Model (CAM) • Spectral element dynamical core on a cubed sphere grid Predictive error [%] Kernel [3 of 194] Model [s] t = f(p) p t = 130k 3.63 ⋅ 10 -6 p ⋅ p + 7.21 ⋅ 10 -13 p 3 Box_rearrange->MPI_Reduce 30.34 24.44+2.26 ⋅ 10 -7 p 2 Vlaplace_sphere_vk 4.28 49.09 0.83 Compute_and_apply_rhs P i ≤ 43k 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 7
Case study – HOMME 10 8 MPI_Reduce vlaplace_sphere_wk 10 6 compute_and_apply_rhs 10 4 Time ( s ) 10 2 Prediction Training 1 0 . 01 2 10 2 12 2 14 2 16 2 18 2 20 2 22 Processes 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 8
Multi-parameter performance modeling Process count Process count Execution time Problem size Floating point operations Hardware Bytes sent and configuration received Algorithm configuration 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 9
Multi-parameter performance modeling Process count (p) Process count Execution time (t) Model: Problem size (n) t = f ( p ) ⋅ g ( n ) Floating point operations OR Hardware t = f ( p ) + g ( n ) Bytes sent and configuration received OR Algorithm … configuration 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 10
Extended performance model normal form n m n ∈ j kl ( x l ) i kl ⋅ log 2 ∑ ∏ m ∈ f ( x 1 ,.., x m ) = c k x l i kl ∈ I j kl ∈ J k = 1 l = 1 I , J ⊂ 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 11
Extended performance model normal form n m n ∈ j kl ( x l ) i kl ⋅ log 2 ∑ ∏ m ∈ f ( x 1 ,.., x m ) = c k x l i kl ∈ I j kl ∈ J k = 1 l = 1 I , J ⊂ Possible parameter interactions c 1 • Constant c 1 + c 2 ⋅ x 1 • Single parameter c 1 + c 2 ⋅ x 1 + c 3 ⋅ x 2 + c 4 ⋅ x 3 • Additive c 1 + c 2 ⋅ x 1 ⋅ x 2 ⋅ x 3 • Multiplicative c 1 + c 2 ⋅ x 1 ⋅ x 3 + c 3 ⋅ x 2 1 ⋅ x 2 ⋅ log 2 ( x 2 ) • Several options 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 12
Requirements engineering Sweep3d Lulesh OpenFoam Milc Clover Leaf HOMME BLAST Re-learn Kripke 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 13
Requirements engineering – a per-process view Memory capacity Memory bandwidth Computational performance Network bandwidth Network 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 14
Requirements engineering – a per-process view Memory capacity Memory bandwidth Computational performance Network bandwidth Network 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 15
Requirements engineering – a per-process view Memory capacity Memory bandwidth Computational performance Network bandwidth Network 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 16
Application requirements Models represent per process effects p – Number of processes n – Problem size per process Lulesh Requirement Metric Model 10 5 ⋅ n ⋅ log( n ) ⋅ p 0.25 ⋅ log( p ) Computation #FLOPs 10 3 ⋅ n ⋅ p 0.25 ⋅ log( p ) Communication #Bytes sent & received 10 5 ⋅ n ⋅ log( n ) ⋅ log( p ) Memory access #Loads & stores 10 5 ⋅ n ⋅ log( n ) Memory footprint #Bytes used 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 17
Co-design using performance models Lulesh Which is the best investement? 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 18
Co-design using performance models 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 19
Co-design using performance models Double the memory 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 20
Co-design using performance models Double the processors 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 21
Co-design using performance models Double the racks 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 22
Co-design using performance models Double the racks p ' = 2 ⋅ p I # Processes m ' = m Memory per process 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 23
Co-design using performance models Double the racks p ' = 2 ⋅ p I # Processes m ' = m Memory per process m ' = m = 10 5 ⋅ n ' ⋅ log( n ') II Memory requirement n ' = n Problem size per process 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 24
Co-design using performance models Double the racks p ' = 2 ⋅ p I # Processes m ' = m Memory per process m ' = m = 10 5 ⋅ n ' ⋅ log( n ') II Memory requirement n ' = n Problem size per process n ' ⋅ p ' = 2 ⋅ n ⋅ p III Overall problem size 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 25
Co-design using performance models Double the racks 10 5 ⋅ n ⋅ log( n ) ⋅ (2 p ) 0.25 ⋅ log(2 p ) IV # FLOPS 2 0.25 ⋅ (1 + 1/ log( p )) Ratio new to old 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 26
Co-design using performance models Double the racks 10 5 ⋅ n ⋅ log( n ) ⋅ (2 p ) 0.25 ⋅ log(2 p ) IV # FLOPS 2 0.25 ⋅ (1 + 1/ log( p )) Ratio new to old 10 3 ⋅ n ⋅ (2 p ) 0.25 ⋅ log(2 p ) #Bytes sent & received V 2 0.25 ⋅ (1 + 1/ log( p )) Ratio new to old 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 27
Visual representation of requirements Communication p ' = 2 ⋅ p m ' = m Computation Problem size p ' = p p ' = 2 ⋅ p Memory m ' = 2 ⋅ m m ' = m / 2 access 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 28
Co-design using performance models Double the racks 10 5 ⋅ n ⋅ log( n ) ⋅ (2 p ) 0.25 ⋅ log(2 p ) IV # FLOPS 2 0.25 ⋅ (1 + 1/ log( p )) Ratio new to old 10 3 ⋅ n ⋅ (2 p ) 0.25 ⋅ log(2 p ) #Bytes sent & received V 2 0.25 ⋅ (1 + 1/ log( p )) Ratio new to old 10 5 ⋅ n ⋅ log( n ) ⋅ log(2 p ) #Loads & stores VI Ratio new to old 1 + 1/ log( p ) 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 29
Recommend
More recommend