modeling power and energy of the task parallel cholesky
play

Modeling Power and Energy of the Task-Parallel Cholesky - PowerPoint PPT Presentation

International Conference on Energy-Aware High Performance Computing Modeling Power and Energy of the Task-Parallel Cholesky Factorization on Multicore Processors Pedro Alonso 1 , Manuel F. Dolz 2 , Rafael Mayo 2 , Enrique S. Quintana-Ort 2 1 2


  1. International Conference on Energy-Aware High Performance Computing Modeling Power and Energy of the Task-Parallel Cholesky Factorization on Multicore Processors Pedro Alonso 1 , Manuel F. Dolz 2 , Rafael Mayo 2 , Enrique S. Quintana-Ort 2 1 2 September 12, 2012, Hamburg (Germany)

  2. Introduction Task-parallelism in the Cholesky factorization Power model Experimental results Conclusions and future work Motivation High performance computing: Optimization of algorithms applied to solve complex problems Technological advance ⇒ improve performance: Higher number of cores per socket (processor) Large number of processors and cores ⇒ High energy consumption Tools to analyze performance and power in order to detect code inefficiencies and reduce energy consumption Manuel F. Dolz et al Modeling Power and Energy of Task-Parallel Cholesky on Multicore Proc.

  3. Introduction Task-parallelism in the Cholesky factorization Power model Experimental results Conclusions and future work Outline 1 Introduction Task-parallelism in the Cholesky factorization 2 Algorithm specification Parallelization SMPSs operation Power model 3 Formulation Environment setup Component estimation Power/energy model testing Experimental results 4 Energy model evaluation Power model evaluation Conclusions and future work 5 Manuel F. Dolz et al Modeling Power and Energy of Task-Parallel Cholesky on Multicore Proc.

  4. Introduction Task-parallelism in the Cholesky factorization Power model Experimental results Conclusions and future work Introduction Parallel scientific applications Examples for dense linear algebra: Cholesky, QR and LU factorizations Tools for power and energy analysis Power profiling in combination with performance/tracing tools for HPC Parallel applications + Power profiling ⇓ Is it possible to predict power/energy consumption? Objective : Power modeling Predict power consumed by applications without power measurement devices. Estimations are needed to determine how to address the power-challenge for energy-efficient hardware and software ⇓ Energy savings Manuel F. Dolz et al Modeling Power and Energy of Task-Parallel Cholesky on Multicore Proc.

  5. Introduction Task-parallelism in the Cholesky factorization Power model Experimental results Conclusions and future work Introduction Parallel scientific applications Examples for dense linear algebra: Cholesky, QR and LU factorizations Tools for power and energy analysis Power profiling in combination with performance/tracing tools for HPC Parallel applications + Power profiling ⇓ Is it possible to predict power/energy consumption? Objective : Power modeling Predict power consumed by applications without power measurement devices. Estimations are needed to determine how to address the power-challenge for energy-efficient hardware and software ⇓ Energy savings Manuel F. Dolz et al Modeling Power and Energy of Task-Parallel Cholesky on Multicore Proc.

  6. Introduction Task-parallelism in the Cholesky factorization Algorithm specification Power model Parallelization Experimental results SMPSs operation Conclusions and future work Algorithm specification Cholesky factorization: A = U T U A ∈ R n × n symmetric definite positive (s.p.d.) matrix U ∈ R n × n unit upper triangular matrix ⇒ Consider a partitioning of matrix A into blocks of size b × b for k = 1 , 2 , . . . , s do Akk = UT kk Ukk Chol: Cholesky factorization for j = k + 1 , k + 2 , . . . , s do Akj ← Akj U − T Trsm: Triangular system solve kk end for for i = k + 1 , k + 2 , . . . , s do Aii ← Aii − AT ki Aki Syrk: Symmetric rank- b update for j = i + 1 , i + 2 , . . . , s do Aij ← Aij − AT ki Akj Gemm: Matrix-matrix product end for end for Iteration 1 end for Manuel F. Dolz et al Modeling Power and Energy of Task-Parallel Cholesky on Multicore Proc.

  7. Introduction Task-parallelism in the Cholesky factorization Algorithm specification Power model Parallelization Experimental results SMPSs operation Conclusions and future work Algorithm specification Cholesky factorization: A = U T U A ∈ R n × n symmetric definite positive (s.p.d.) matrix U ∈ R n × n unit upper triangular matrix ⇒ Consider a partitioning of matrix A into blocks of size b × b for k = 1 , 2 , . . . , s do Akk = UT kk Ukk Chol: Cholesky factorization for j = k + 1 , k + 2 , . . . , s do Akj ← Akj U − T Trsm: Triangular system solve kk end for for i = k + 1 , k + 2 , . . . , s do Aii ← Aii − AT ki Aki Syrk: Symmetric rank- b update for j = i + 1 , i + 2 , . . . , s do Aij ← Aij − AT ki Akj Gemm: Matrix-matrix product end for end for Iteration 2 end for Manuel F. Dolz et al Modeling Power and Energy of Task-Parallel Cholesky on Multicore Proc.

  8. Introduction Task-parallelism in the Cholesky factorization Algorithm specification Power model Parallelization Experimental results SMPSs operation Conclusions and future work Algorithm specification Cholesky factorization: A = U T U A ∈ R n × n symmetric definite positive (s.p.d.) matrix U ∈ R n × n unit upper triangular matrix ⇒ Consider a partitioning of matrix A into blocks of size b × b for k = 1 , 2 , . . . , s do Akk = UT kk Ukk Chol: Cholesky factorization for j = k + 1 , k + 2 , . . . , s do Akj ← Akj U − T Trsm: Triangular system solve kk end for for i = k + 1 , k + 2 , . . . , s do Aii ← Aii − AT ki Aki Syrk: Symmetric rank- b update for j = i + 1 , i + 2 , . . . , s do Aij ← Aij − AT ki Akj Gemm: Matrix-matrix product end for end for Iteration 3 end for Manuel F. Dolz et al Modeling Power and Energy of Task-Parallel Cholesky on Multicore Proc.

  9. Introduction Task-parallelism in the Cholesky factorization Algorithm specification Power model Parallelization Experimental results SMPSs operation Conclusions and future work Algorithm specification Cholesky factorization: A = U T U A ∈ R n × n symmetric definite positive (s.p.d.) matrix U ∈ R n × n unit upper triangular matrix ⇒ Consider a partitioning of matrix A into blocks of size b × b for k = 1 , 2 , . . . , s do Akk = UT kk Ukk Chol: Cholesky factorization for j = k + 1 , k + 2 , . . . , s do Akj ← Akj U − T Trsm: Triangular system solve kk end for for i = k + 1 , k + 2 , . . . , s do Aii ← Aii − AT ki Aki Syrk: Symmetric rank- b update for j = i + 1 , i + 2 , . . . , s do Aij ← Aij − AT ki Akj Gemm: Matrix-matrix product end for end for Iteration 4 end for Manuel F. Dolz et al Modeling Power and Energy of Task-Parallel Cholesky on Multicore Proc.

  10. Introduction Task-parallelism in the Cholesky factorization Algorithm specification Power model Parallelization Experimental results SMPSs operation Conclusions and future work Algorithm specification Cholesky factorization: A = U T U A ∈ R n × n symmetric definite positive (s.p.d.) matrix U ∈ R n × n unit upper triangular matrix ⇒ Consider a partitioning of matrix A into blocks of size b × b for k = 1 , 2 , . . . , s do Akk = UT kk Ukk Chol: Cholesky factorization for j = k + 1 , k + 2 , . . . , s do Akj ← Akj U − T Trsm: Triangular system solve kk end for for i = k + 1 , k + 2 , . . . , s do Aii ← Aii − AT ki Aki Syrk: Symmetric rank- b update for j = i + 1 , i + 2 , . . . , s do Aij ← Aij − AT ki Akj Gemm: Matrix-matrix product end for end for Iteration 5 end for Parallelization ⇒ Not trivial at code level! Manuel F. Dolz et al Modeling Power and Energy of Task-Parallel Cholesky on Multicore Proc.

  11. Introduction Task-parallelism in the Cholesky factorization Algorithm specification Power model Parallelization Experimental results SMPSs operation Conclusions and future work Parallelization Option 1: Use multi-threaded BLAS Straightforward approach towards LAPACK-level parallelization Highly tuned multi-threaded kernels: Intel MKL, AMD ACML or IBM ESSL,... Fork/join approach: parallelism is not fully exploited → → → → ... ... ... ... ... Manuel F. Dolz et al Modeling Power and Energy of Task-Parallel Cholesky on Multicore Proc.

  12. Introduction Task-parallelism in the Cholesky factorization Algorithm specification Power model Parallelization Experimental results SMPSs operation Conclusions and future work Parallelization Option 2: Use a runtime task scheduler We use SMPSs runtime-compiler framework to exploit task-parallelism Functions in code are annotated as tasks using OpenMP-like pragmas #pragma css task Operations are not executed in the order they appear in the code but respecting data dependencies SMPSs easily obtains performance traces which can be analyzed using Paraver (Performance analysis tools from Barcelona Supercomputing Center) SMPSs proceeds in two stages: A symbolic execution produces a DAG containing dependencies 1 DAG dictates the feasible orderings in which task can be executed 2 Syrk Trsm Gemm Syrk Trsm Syrk Chol Gemm Gemm Trsm Chol Trsm Syrk Chol Chol Trsm Syrk Gemm Trsm Syrk Figure: Right-looking Cholesky DAG with a matrix consisting of 4 × 4 blocks Manuel F. Dolz et al Modeling Power and Energy of Task-Parallel Cholesky on Multicore Proc.

Recommend


More recommend