Energy-Aware Matrix Computations on Multi-Core and Many-core - PowerPoint PPT Presentation

Energy-Aware Matrix Computations on Multi-Core and Many-core Platforms Enrique S. Quintana-Ortí Universidad Complutense de Madrid June, 2012

Performance and energy consumption  Top500 (November 2011) Rank Site #Cores LINPACK (TFLOPS) RIKEN AICS K Computer – Spar64 1 705,024 10,510.00 * VIIIfx (8-core) NSC Tianjin – NUDT YH MPP, Xeon 2 186,368 2,566.00 X5670 6C 2.93 GHz, NVIDIA 2050 DOE ORNL – Cray XT5-HE Opteron 224,162 3 1,759.00 6-core 2.6 GHz CEA (France) – Bull bullx super-node 9 138,368 1,050.00 S6010/S6030 BSC (Spain) – Bull B505, Xeon E5649 5,544 114 103.20 6C 2.53 GHz, NVIDIA 2090 * 1 day K Computer = 394 years of the world population (7.000 million people) with a hand calculator Universidad Complutense de Madrid June, 2012

Performance and energy consumption  Green500 (November 2011) Rank Site #Cores MFLOPS/W LINPACK (TFLOPS) Green/Top IBM Rochester – BlueGene/Q, 32,768 1/29 2,026.48 339.83 Power BQC 16C 1.60 GHz BSC (Spain) – Bull B505, Xeon 5,544 7/114 1,266.26 103.20 E5649 6C 2.53 GHz, NVIDIA 2090 RIKEN AICS K Computer – 32/1 705,024 830.18 10,510.00 Spar64 VIIIfx (8-core) NSC Tianjin – NUDT YH MPP, 47/2 186,368 635.15 2,566.00 Xeon X5670 6C 2.93 GHz, NVIDIA 2050 DOE ORNL – Cray XT5-HE 53/3 582,00 Cray 1,759.00 Opteron 6-core 2.6 GHz Universidad Complutense de Madrid June, 2012

Multi-core and many-core platforms  “Conventional” architectures  New challengers… Universidad Complutense de Madrid June, 2012

Matrix computations  Linear algebra? Please, don’t run away! Determinants, linear systems,  least squares fitting, FFT, etc.  Importance: Intel MKL, AMD ACML, IBM ESSL, NVIDIA CUBLAS,  ongoing for TI Universidad Complutense de Madrid June, 2012

Index 1. Scientific applications 2. Leveraging concurrency 3. Cost of energy Universidad Complutense de Madrid June, 2012

Scientific applications Biological systems  Simulations of molecular dynamics  Solve AX = BX  , dense A,B → n x n n = 134,484 Universidad Complutense de Madrid June, 2012

Scientific applications Industrial processes  Optimal cooling of steel profiles  Solve A T X + X A – X S X + Q = 0 , dense A → n x n n = 5,177 for a mesh width of 6.91∙10 -3 Universidad Complutense de Madrid June, 2012

Scientific applications Summary  Dense linear algebra is at the bottom of the “food chain” for many scientific and engineering apps. Fast acoustic scattering problems  Dielectric polarization of nanostructures  Magneto-hydrodynamics  Macro-economics  Universidad Complutense de Madrid June, 2012

Index 1. Scientific applications 2. Leveraging hardware concurrency 3. Cost of energy Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Threads Linear system  n Time Time Time } 16-node 2 x + 3 y = 3 1 core 8 cores cluster, 8-core 4 x - 5 y = 6 per node, i.e., 192 cores A X = B , with dense A, B 100 33.33 ms -- -- → n x n: 1.000 0.33 s -- -- ≈ 2n 3 /3 + 2n 3 flops 10 4 333.33 s -- 41.62 s Intel Xeon:  10 5 > 92 h > 11 h > 28 m 4 DP flops/cycle, e.g., at f =2.0 GHz Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Threads 2010 PFLOPS (10 15 flops/sec.) 2020 EFLOPS (10 18 flops/sec.) 2010 JUGENE 10 9 core level 10 9.5 core level   (PowerPC 450, 850MHz → 3.4 GFLOPS) 10 1 node level 10 3 node level!   (Quad-Core) 10 5 cluster level 10 5.5 cluster level   (73.728 nodes) Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Cholesky factorization L T L = A * Key in the solution of s.p.d. linear systems  A x = b (LL T )x = b L y = b  y x = y  x L T Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Cholesky factorization (blocked) T A 11 = L 11 * L 11 F: -T L 21  A 21 * L 11 T: T A 22  A 22 – L 21 * L 21 U:  Reuse data in cache 1st iteration  MT processor: Employ a MT implementation of T and P Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Cholesky factorization (blocked) … 2nd iteration 1st iteration 3rd iteration Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Cholesky factorization (blocked) for (k=1; k<=n/b; k++){ F: Chol(A[k,k]); // A kk = L kk * L kk T if (k<=n/b){ T: Trsm(A[k,k], A[k+1,k]); // L k+1,k  A k+1,k * L kk -T // A k+1,k+1  A k+1,k+1 U: Syrk(A[k+1,k], A[k+1,k+1]); // - L k+1,k * L k+1,k T } } Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Cholesky factorization (blocked) 57% peak 71% peak 80% peak Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Algorithmic parallelism  Why? Excessive thread synchronization for (k=1; k<=n/b; k++){ Chol(A[k,k]); // A kk = L kk * L kk T F: if (k<=n/b){ Trsm(A[k,k], A[k+1,k]); // L k+1,k  A k+1,k * L kk -T T: // A k+1,k+1  A k+1,k+1 Syrk(A[k+1,k], A[k+1,k+1]); U: // - L k+1,k * L k+1,k T } } Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Algorithmic parallelism  …but there is much more parallelism!!! 3rd iteration 2nd iteration 1st iteration Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Algorithmic parallelism  …but there is much more parallelism!!! Inside the same iteration In different iterations 2nd iteration 1st iteration How can we leverage it? Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Task parallelism Scalar code (Super)scalar processor loop: ld f0, 0(r1) addd f4, f0, f2 IF ID ISS UF 0 sd f4, 0(r1) UF 1 addi r1, r1, #8 subi r2, r2, #1 UF 2 bnez r2, loop Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Task parallelism  Something similar for (dense) linear algebra? 1st iter. for (k=1; k<=n/b; k++){ Chol(A[k,k]); F: for (i=k+1; i<=n/b; i++) Trsm(A[k,k], A[i,k]); T: 2nd iter. for (i=k+1; i<nb; i++){ Syrk(A[i,k],A[i,i]); U1: for (j=k+1; j<=i; j++) U2: Gemm(A[i,k], A[j,k], A[i,j]); } } 3rd iter. Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Task parallelism  Something similar for (dense) linear algebra? 1st iter. Apply “ scalar ” techniques at the block level  Software implementation  Thread/Task-level parallelism  Target the cores/GPUs of the platform  2nd iter. 3rd iter. Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Task parallelism  Read/written blocks determine dependencies, as in scalar case loop: ld f0, 0(r1) for (k=1; k<=n/b; k++){ addd f4, f0, f2 Chol(A[k,k]); sd f4, 0(r1) for (i=k+1; i<=n/b; i++) addi r1, r1, #8 … Trsm(A[k,k], A[i,k]); … Dependencies form a dependency DAG (task tree) … … Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Task parallelism  Runtime : ID ISS N 0   Decode (ID): Generate N 1 the task tree with a “symbolic analysis” of the N 2 code at execution time  Issue (ISS): Architecture- aware execution of the tasks in the tree Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Task parallelism  Decode stage: ID ISS N 0  “ Symbolic analysis ” of the code N 1 N 2 Blocked code: Task tree: for (k=1; k<=n/b; k++){  Chol(A[k,k]); … for (i=k+1; i<=n/b; i++) Trsm(A[k,k], A[i,k ]); … Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Task parallelism  Issue stage: ID ISS N 0 Temporal scheduling of tasks,  N 1 attending to dependencies Mapping (spatial scheduling) of  N 2 tasks to resources, aware of locality  … Universidad Complutense de Madrid June, 2012

Leveraging hw. concurrency Implementations  SuperMatrix (UT@Austin and UJI)  Read/written blocks defined implicitly by the operations  Only valid for dense linear algebra operations encoded in libflame  SMPSs (BSC) and GPUSs (BSC and UJI)  OpenMP-like languages #pragma css task inout(A[b*b]) void Chol(double *A);  Applicable to task-parallel codes on different platforms: multi-core, multi-GPU, multi- accelerators, Grid,… Universidad Complutense de Madrid June, 2012

Index 1. Scientific applications 2. Leveraging hardware concurrency 3. Cost of energy Universidad Complutense de Madrid June, 2012

Cost of energy “Computer Architecture: A Quantitative Approach” J. Hennessy, D. Patterson, 2011 Universidad Complutense de Madrid June, 2012

Cost of energy  “The free lunch is over” (H. Sutter, 2005) Frequency wall Instruction-level parallelism (ILP) wall Memory wall Universidad Complutense de Madrid June, 2012

Cost of energy  Frequency wall Power - energy  consumption proportional to f 3 - f 2 Electricity = money  1 st Law of  Thermodynamics: Energy cannot be created or destroyed, only converted Cost of extracting heat  Heat reduces lifetime  Universidad Complutense de Madrid June, 2012

Energy-Aware Matrix Computations on Multi-Core and Many-core - PowerPoint PPT Presentation

Energy-Aware Matrix Computations on Multi-Core and Many-core Platforms Enrique S. Quintana-Ort Universidad Complutense de Madrid June, 2012 Performance and energy consumption Top500 (November 2011) Rank Site #Cores LINPACK (TFLOPS)

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

Energy-aware Techniques and Models for Matrix Computations Manuel F. Dolz dolzm@icc.uji.es

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

9. Hardware-Aware Numerics Approaching supercomputing ... 9. Hardware-Aware Numerics Numerical

Structuring Computations Structuring Computations Contents Jacobs Types06, 18/4/06

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Toolkit to Support Intelligibility in Context Aware Applications Context-Aware Applications P

Matrix COSEC Right People in Right Place at Right Time Matrix COmplete SECurity Matrix COSEC

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Complexity of matrix multiplication (For Hierarchical matrix) For Usual matrix The

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix

What is it? Whats changed lately? Whats next? @benpa:matrix.org benp@matrix.org

Splitting methods in geometric numerical integration of differential equations Fernando Casas

Development of an Ensemble-Based Data Assimilation System with a Coupled Atmosphere Ocean GCM

Large power factor improvement in a thermoelectric oxide using liquid electrolytes Jorge

Robust combinatorial optimization with variable uncertainty Michael Poss Heudiasyc UMR CNRS

The Network Expansion Problem with Non-linear Costs Saeedeh Ketabi Department of Management,

Controllability and stability of difference equations and applications Guilherme Mazanti

Scheduling and Timetabling, Lecture 3 Han Hoogeveen, Utrecht University 1 Lecture today Finding

Natural Language Understanding Lecture 12: Recurrent Neural Networks and LSTMs Adam Lopez

Sambuz

Useful Links

Newsletter

Mail Us