Energy-Aware Matrix Computations on Multi-Core and Many-core Platforms Enrique S. Quintana-Ortí Universidad Complutense de Madrid June, 2012
Performance and energy consumption Top500 (November 2011) Rank Site #Cores LINPACK (TFLOPS) RIKEN AICS K Computer – Spar64 1 705,024 10,510.00 * VIIIfx (8-core) NSC Tianjin – NUDT YH MPP, Xeon 2 186,368 2,566.00 X5670 6C 2.93 GHz, NVIDIA 2050 DOE ORNL – Cray XT5-HE Opteron 224,162 3 1,759.00 6-core 2.6 GHz CEA (France) – Bull bullx super-node 9 138,368 1,050.00 S6010/S6030 BSC (Spain) – Bull B505, Xeon E5649 5,544 114 103.20 6C 2.53 GHz, NVIDIA 2090 * 1 day K Computer = 394 years of the world population (7.000 million people) with a hand calculator Universidad Complutense de Madrid June, 2012
Performance and energy consumption Green500 (November 2011) Rank Site #Cores MFLOPS/W LINPACK (TFLOPS) Green/Top IBM Rochester – BlueGene/Q, 32,768 1/29 2,026.48 339.83 Power BQC 16C 1.60 GHz BSC (Spain) – Bull B505, Xeon 5,544 7/114 1,266.26 103.20 E5649 6C 2.53 GHz, NVIDIA 2090 RIKEN AICS K Computer – 32/1 705,024 830.18 10,510.00 Spar64 VIIIfx (8-core) NSC Tianjin – NUDT YH MPP, 47/2 186,368 635.15 2,566.00 Xeon X5670 6C 2.93 GHz, NVIDIA 2050 DOE ORNL – Cray XT5-HE 53/3 582,00 Cray 1,759.00 Opteron 6-core 2.6 GHz Universidad Complutense de Madrid June, 2012
Multi-core and many-core platforms “Conventional” architectures New challengers… Universidad Complutense de Madrid June, 2012
Matrix computations Linear algebra? Please, don’t run away! Determinants, linear systems, least squares fitting, FFT, etc. Importance: Intel MKL, AMD ACML, IBM ESSL, NVIDIA CUBLAS, ongoing for TI Universidad Complutense de Madrid June, 2012
Index 1. Scientific applications 2. Leveraging concurrency 3. Cost of energy Universidad Complutense de Madrid June, 2012
Index 1. Scientific applications 2. Leveraging concurrency 3. Cost of energy Universidad Complutense de Madrid June, 2012
Scientific applications Biological systems Simulations of molecular dynamics Solve AX = BX , dense A,B → n x n n = 134,484 Universidad Complutense de Madrid June, 2012
Scientific applications Industrial processes Optimal cooling of steel profiles Solve A T X + X A – X S X + Q = 0 , dense A → n x n n = 5,177 for a mesh width of 6.91∙10 -3 Universidad Complutense de Madrid June, 2012
Scientific applications Summary Dense linear algebra is at the bottom of the “food chain” for many scientific and engineering apps. Fast acoustic scattering problems Dielectric polarization of nanostructures Magneto-hydrodynamics Macro-economics Universidad Complutense de Madrid June, 2012
Index 1. Scientific applications 2. Leveraging hardware concurrency 3. Cost of energy Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Threads Linear system n Time Time Time } 16-node 2 x + 3 y = 3 1 core 8 cores cluster, 8-core 4 x - 5 y = 6 per node, i.e., 192 cores A X = B , with dense A, B 100 33.33 ms -- -- → n x n: 1.000 0.33 s -- -- ≈ 2n 3 /3 + 2n 3 flops 10 4 333.33 s -- 41.62 s Intel Xeon: 10 5 > 92 h > 11 h > 28 m 4 DP flops/cycle, e.g., at f =2.0 GHz Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Threads 2010 PFLOPS (10 15 flops/sec.) 2020 EFLOPS (10 18 flops/sec.) 2010 JUGENE 10 9 core level 10 9.5 core level (PowerPC 450, 850MHz → 3.4 GFLOPS) 10 1 node level 10 3 node level! (Quad-Core) 10 5 cluster level 10 5.5 cluster level (73.728 nodes) Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Cholesky factorization L T L = A * Key in the solution of s.p.d. linear systems A x = b (LL T )x = b L y = b y x = y x L T Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Cholesky factorization (blocked) T A 11 = L 11 * L 11 F: -T L 21 A 21 * L 11 T: T A 22 A 22 – L 21 * L 21 U: Reuse data in cache 1st iteration MT processor: Employ a MT implementation of T and P Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Cholesky factorization (blocked) … 2nd iteration 1st iteration 3rd iteration Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Cholesky factorization (blocked) for (k=1; k<=n/b; k++){ F: Chol(A[k,k]); // A kk = L kk * L kk T if (k<=n/b){ T: Trsm(A[k,k], A[k+1,k]); // L k+1,k A k+1,k * L kk -T // A k+1,k+1 A k+1,k+1 U: Syrk(A[k+1,k], A[k+1,k+1]); // - L k+1,k * L k+1,k T } } Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Cholesky factorization (blocked) 57% peak 71% peak 80% peak Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Algorithmic parallelism Why? Excessive thread synchronization for (k=1; k<=n/b; k++){ Chol(A[k,k]); // A kk = L kk * L kk T F: if (k<=n/b){ Trsm(A[k,k], A[k+1,k]); // L k+1,k A k+1,k * L kk -T T: // A k+1,k+1 A k+1,k+1 Syrk(A[k+1,k], A[k+1,k+1]); U: // - L k+1,k * L k+1,k T } } Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Algorithmic parallelism …but there is much more parallelism!!! 3rd iteration 2nd iteration 1st iteration Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Algorithmic parallelism …but there is much more parallelism!!! Inside the same iteration In different iterations 2nd iteration 1st iteration How can we leverage it? Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Task parallelism Scalar code (Super)scalar processor loop: ld f0, 0(r1) addd f4, f0, f2 IF ID ISS UF 0 sd f4, 0(r1) UF 1 addi r1, r1, #8 subi r2, r2, #1 UF 2 bnez r2, loop Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Task parallelism Something similar for (dense) linear algebra? 1st iter. for (k=1; k<=n/b; k++){ Chol(A[k,k]); F: for (i=k+1; i<=n/b; i++) Trsm(A[k,k], A[i,k]); T: 2nd iter. for (i=k+1; i<nb; i++){ Syrk(A[i,k],A[i,i]); U1: for (j=k+1; j<=i; j++) U2: Gemm(A[i,k], A[j,k], A[i,j]); } } 3rd iter. Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Task parallelism Something similar for (dense) linear algebra? 1st iter. Apply “ scalar ” techniques at the block level Software implementation Thread/Task-level parallelism Target the cores/GPUs of the platform 2nd iter. 3rd iter. Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Task parallelism Read/written blocks determine dependencies, as in scalar case loop: ld f0, 0(r1) for (k=1; k<=n/b; k++){ addd f4, f0, f2 Chol(A[k,k]); sd f4, 0(r1) for (i=k+1; i<=n/b; i++) addi r1, r1, #8 … Trsm(A[k,k], A[i,k]); … Dependencies form a dependency DAG (task tree) … … Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Task parallelism Runtime : ID ISS N 0 Decode (ID): Generate N 1 the task tree with a “symbolic analysis” of the N 2 code at execution time Issue (ISS): Architecture- aware execution of the tasks in the tree Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Task parallelism Decode stage: ID ISS N 0 “ Symbolic analysis ” of the code N 1 N 2 Blocked code: Task tree: for (k=1; k<=n/b; k++){ Chol(A[k,k]); … for (i=k+1; i<=n/b; i++) Trsm(A[k,k], A[i,k ]); … Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Task parallelism Issue stage: ID ISS N 0 Temporal scheduling of tasks, N 1 attending to dependencies Mapping (spatial scheduling) of N 2 tasks to resources, aware of locality … Universidad Complutense de Madrid June, 2012
Leveraging hw. concurrency Implementations SuperMatrix (UT@Austin and UJI) Read/written blocks defined implicitly by the operations Only valid for dense linear algebra operations encoded in libflame SMPSs (BSC) and GPUSs (BSC and UJI) OpenMP-like languages #pragma css task inout(A[b*b]) void Chol(double *A); Applicable to task-parallel codes on different platforms: multi-core, multi-GPU, multi- accelerators, Grid,… Universidad Complutense de Madrid June, 2012
Index 1. Scientific applications 2. Leveraging hardware concurrency 3. Cost of energy Universidad Complutense de Madrid June, 2012
Cost of energy “Computer Architecture: A Quantitative Approach” J. Hennessy, D. Patterson, 2011 Universidad Complutense de Madrid June, 2012
Cost of energy “The free lunch is over” (H. Sutter, 2005) Frequency wall Instruction-level parallelism (ILP) wall Memory wall Universidad Complutense de Madrid June, 2012
Cost of energy Frequency wall Power - energy consumption proportional to f 3 - f 2 Electricity = money 1 st Law of Thermodynamics: Energy cannot be created or destroyed, only converted Cost of extracting heat Heat reduces lifetime Universidad Complutense de Madrid June, 2012
Recommend
More recommend