energy aware matrix computations on
play

Energy-Aware Matrix Computations on Multi-Core and Many-core - PowerPoint PPT Presentation

Energy-Aware Matrix Computations on Multi-Core and Many-core Platforms Enrique S. Quintana-Ort Universidad Complutense de Madrid June, 2012 Performance and energy consumption Top500 (November 2011) Rank Site #Cores LINPACK (TFLOPS)


  1. Energy-Aware Matrix Computations on Multi-Core and Many-core Platforms Enrique S. Quintana-Ortí Universidad Complutense de Madrid June, 2012

  2. Performance and energy consumption  Top500 (November 2011) Rank Site #Cores LINPACK (TFLOPS) RIKEN AICS K Computer – Spar64 1 705,024 10,510.00 * VIIIfx (8-core) NSC Tianjin – NUDT YH MPP, Xeon 2 186,368 2,566.00 X5670 6C 2.93 GHz, NVIDIA 2050 DOE ORNL – Cray XT5-HE Opteron 224,162 3 1,759.00 6-core 2.6 GHz CEA (France) – Bull bullx super-node 9 138,368 1,050.00 S6010/S6030 BSC (Spain) – Bull B505, Xeon E5649 5,544 114 103.20 6C 2.53 GHz, NVIDIA 2090 * 1 day K Computer = 394 years of the world population (7.000 million people) with a hand calculator Universidad Complutense de Madrid June, 2012

  3. Performance and energy consumption  Green500 (November 2011) Rank Site #Cores MFLOPS/W LINPACK (TFLOPS) Green/Top IBM Rochester – BlueGene/Q, 32,768 1/29 2,026.48 339.83 Power BQC 16C 1.60 GHz BSC (Spain) – Bull B505, Xeon 5,544 7/114 1,266.26 103.20 E5649 6C 2.53 GHz, NVIDIA 2090 RIKEN AICS K Computer – 32/1 705,024 830.18 10,510.00 Spar64 VIIIfx (8-core) NSC Tianjin – NUDT YH MPP, 47/2 186,368 635.15 2,566.00 Xeon X5670 6C 2.93 GHz, NVIDIA 2050 DOE ORNL – Cray XT5-HE 53/3 582,00 Cray 1,759.00 Opteron 6-core 2.6 GHz Universidad Complutense de Madrid June, 2012

  4. Multi-core and many-core platforms  “Conventional” architectures  New challengers… Universidad Complutense de Madrid June, 2012

  5. Matrix computations  Linear algebra? Please, don’t run away! Determinants, linear systems,  least squares fitting, FFT, etc.  Importance: Intel MKL, AMD ACML, IBM ESSL, NVIDIA CUBLAS,  ongoing for TI Universidad Complutense de Madrid June, 2012

  6. Index 1. Scientific applications 2. Leveraging concurrency 3. Cost of energy Universidad Complutense de Madrid June, 2012

  7. Index 1. Scientific applications 2. Leveraging concurrency 3. Cost of energy Universidad Complutense de Madrid June, 2012

  8. Scientific applications Biological systems  Simulations of molecular dynamics  Solve AX = BX  , dense A,B → n x n n = 134,484 Universidad Complutense de Madrid June, 2012

  9. Scientific applications Industrial processes  Optimal cooling of steel profiles  Solve A T X + X A – X S X + Q = 0 , dense A → n x n n = 5,177 for a mesh width of 6.91∙10 -3 Universidad Complutense de Madrid June, 2012

  10. Scientific applications Summary  Dense linear algebra is at the bottom of the “food chain” for many scientific and engineering apps. Fast acoustic scattering problems  Dielectric polarization of nanostructures  Magneto-hydrodynamics  Macro-economics  Universidad Complutense de Madrid June, 2012

  11. Index 1. Scientific applications 2. Leveraging hardware concurrency 3. Cost of energy Universidad Complutense de Madrid June, 2012

  12. Leveraging hw. concurrency Threads Linear system  n Time Time Time } 16-node 2 x + 3 y = 3 1 core 8 cores cluster, 8-core 4 x - 5 y = 6 per node, i.e., 192 cores A X = B , with dense A, B 100 33.33 ms -- -- → n x n: 1.000 0.33 s -- -- ≈ 2n 3 /3 + 2n 3 flops 10 4 333.33 s -- 41.62 s Intel Xeon:  10 5 > 92 h > 11 h > 28 m 4 DP flops/cycle, e.g., at f =2.0 GHz Universidad Complutense de Madrid June, 2012

  13. Leveraging hw. concurrency Threads 2010 PFLOPS (10 15 flops/sec.) 2020 EFLOPS (10 18 flops/sec.) 2010 JUGENE 10 9 core level 10 9.5 core level   (PowerPC 450, 850MHz → 3.4 GFLOPS) 10 1 node level 10 3 node level!   (Quad-Core) 10 5 cluster level 10 5.5 cluster level   (73.728 nodes) Universidad Complutense de Madrid June, 2012

  14. Leveraging hw. concurrency Cholesky factorization L T L = A * Key in the solution of s.p.d. linear systems  A x = b (LL T )x = b L y = b  y x = y  x L T Universidad Complutense de Madrid June, 2012

  15. Leveraging hw. concurrency Cholesky factorization (blocked) T A 11 = L 11 * L 11 F: -T L 21  A 21 * L 11 T: T A 22  A 22 – L 21 * L 21 U:  Reuse data in cache 1st iteration  MT processor: Employ a MT implementation of T and P Universidad Complutense de Madrid June, 2012

  16. Leveraging hw. concurrency Cholesky factorization (blocked) … 2nd iteration 1st iteration 3rd iteration Universidad Complutense de Madrid June, 2012

  17. Leveraging hw. concurrency Cholesky factorization (blocked) for (k=1; k<=n/b; k++){ F: Chol(A[k,k]); // A kk = L kk * L kk T if (k<=n/b){ T: Trsm(A[k,k], A[k+1,k]); // L k+1,k  A k+1,k * L kk -T // A k+1,k+1  A k+1,k+1 U: Syrk(A[k+1,k], A[k+1,k+1]); // - L k+1,k * L k+1,k T } } Universidad Complutense de Madrid June, 2012

  18. Leveraging hw. concurrency Cholesky factorization (blocked) 57% peak 71% peak 80% peak Universidad Complutense de Madrid June, 2012

  19. Leveraging hw. concurrency Algorithmic parallelism  Why? Excessive thread synchronization for (k=1; k<=n/b; k++){ Chol(A[k,k]); // A kk = L kk * L kk T F: if (k<=n/b){ Trsm(A[k,k], A[k+1,k]); // L k+1,k  A k+1,k * L kk -T T: // A k+1,k+1  A k+1,k+1 Syrk(A[k+1,k], A[k+1,k+1]); U: // - L k+1,k * L k+1,k T } } Universidad Complutense de Madrid June, 2012

  20. Leveraging hw. concurrency Algorithmic parallelism  …but there is much more parallelism!!! 3rd iteration 2nd iteration 1st iteration Universidad Complutense de Madrid June, 2012

  21. Leveraging hw. concurrency Algorithmic parallelism  …but there is much more parallelism!!! Inside the same iteration In different iterations 2nd iteration 1st iteration How can we leverage it? Universidad Complutense de Madrid June, 2012

  22. Leveraging hw. concurrency Task parallelism Scalar code (Super)scalar processor loop: ld f0, 0(r1) addd f4, f0, f2 IF ID ISS UF 0 sd f4, 0(r1) UF 1 addi r1, r1, #8 subi r2, r2, #1 UF 2 bnez r2, loop Universidad Complutense de Madrid June, 2012

  23. Leveraging hw. concurrency Task parallelism  Something similar for (dense) linear algebra? 1st iter. for (k=1; k<=n/b; k++){ Chol(A[k,k]); F: for (i=k+1; i<=n/b; i++) Trsm(A[k,k], A[i,k]); T: 2nd iter. for (i=k+1; i<nb; i++){ Syrk(A[i,k],A[i,i]); U1: for (j=k+1; j<=i; j++) U2: Gemm(A[i,k], A[j,k], A[i,j]); } } 3rd iter. Universidad Complutense de Madrid June, 2012

  24. Leveraging hw. concurrency Task parallelism  Something similar for (dense) linear algebra? 1st iter. Apply “ scalar ” techniques at the block level  Software implementation  Thread/Task-level parallelism  Target the cores/GPUs of the platform  2nd iter. 3rd iter. Universidad Complutense de Madrid June, 2012

  25. Leveraging hw. concurrency Task parallelism  Read/written blocks determine dependencies, as in scalar case loop: ld f0, 0(r1) for (k=1; k<=n/b; k++){ addd f4, f0, f2 Chol(A[k,k]); sd f4, 0(r1) for (i=k+1; i<=n/b; i++) addi r1, r1, #8 … Trsm(A[k,k], A[i,k]); … Dependencies form a dependency DAG (task tree) … … Universidad Complutense de Madrid June, 2012

  26. Leveraging hw. concurrency Task parallelism  Runtime : ID ISS N 0   Decode (ID): Generate N 1 the task tree with a “symbolic analysis” of the N 2 code at execution time  Issue (ISS): Architecture- aware execution of the tasks in the tree Universidad Complutense de Madrid June, 2012

  27. Leveraging hw. concurrency Task parallelism  Decode stage: ID ISS N 0  “ Symbolic analysis ” of the code N 1 N 2 Blocked code: Task tree: for (k=1; k<=n/b; k++){  Chol(A[k,k]); … for (i=k+1; i<=n/b; i++) Trsm(A[k,k], A[i,k ]); … Universidad Complutense de Madrid June, 2012

  28. Leveraging hw. concurrency Task parallelism  Issue stage: ID ISS N 0 Temporal scheduling of tasks,  N 1 attending to dependencies Mapping (spatial scheduling) of  N 2 tasks to resources, aware of locality  … Universidad Complutense de Madrid June, 2012

  29. Leveraging hw. concurrency Implementations  SuperMatrix (UT@Austin and UJI)  Read/written blocks defined implicitly by the operations  Only valid for dense linear algebra operations encoded in libflame  SMPSs (BSC) and GPUSs (BSC and UJI)  OpenMP-like languages #pragma css task inout(A[b*b]) void Chol(double *A);  Applicable to task-parallel codes on different platforms: multi-core, multi-GPU, multi- accelerators, Grid,… Universidad Complutense de Madrid June, 2012

  30. Index 1. Scientific applications 2. Leveraging hardware concurrency 3. Cost of energy Universidad Complutense de Madrid June, 2012

  31. Cost of energy “Computer Architecture: A Quantitative Approach” J. Hennessy, D. Patterson, 2011 Universidad Complutense de Madrid June, 2012

  32. Cost of energy  “The free lunch is over” (H. Sutter, 2005) Frequency wall Instruction-level parallelism (ILP) wall Memory wall Universidad Complutense de Madrid June, 2012

  33. Cost of energy  Frequency wall Power - energy  consumption proportional to f 3 - f 2 Electricity = money  1 st Law of  Thermodynamics: Energy cannot be created or destroyed, only converted Cost of extracting heat  Heat reduces lifetime  Universidad Complutense de Madrid June, 2012

Recommend


More recommend