A High Performance Computing Course Guided by the LU Factorization - PowerPoint PPT Presentation

A High Performance Computing Course Guided by the LU Factorization Gregorio Bernabé , Javier Cuenca, Domingo Giménez, Luis P . García and Sergio Rivas Universidad de Murcia/Universidad Politécnica de Cartagena Scientific Computing and Parallel Programming Group International Conference on Computational Science June 10-12, 2014 Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 1 / 24

Outline General organization of the course 1 The LU factorization 2 Development of the course 3 Evaluating Teaching 4 Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 2 / 24

Course description Parallel Programming and High Performance Computing Master in New Technologies in Computer Science Specialization of High Performance Architectures and Supercomputing Small class ⇒ high level students, interested in the subject Initiation to research ⇒ techniques for the Master’s Thesis Guided by the LU factorization Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 4 / 24

Syllabus Parallel programming environments OpenMP , MPI, CUDA Matrix computation Sequential algorithms, Algorithms by blocks, Out-of-core algorithms, Parallel algorithms Numerical libraries BLAS, LAPACK, MKL, PLASMA, MAGMA, ScaLAPACK Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 5 / 24

Proposed problem LU factorization of large matrices in today’s heterogeneous computational systems Students use LU factorization to develop their own implementations based on Efficient use of optimized libraries Use of different parallel programming paradigms Out-of-core techniques for large matrices Combination of the different approaches for clusters of multicore+GPU Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 6 / 24

LU factorization by blocks A LU factorization basic version is explained to the students to work. This version is based on four steps: A 00 A 01 A 02 L 00 U 00 U 01 U 02       A 10 A 11 A 12 L 10 L 11 U 11 U 12  ∗  =     A 20 A 21 A 22 L 20 L 21 L 22 U 22 Step 1: A 00 = L 00 ∗ U 00 (LU no blocks factorization) Step 2: A 0 i = L 00 ∗ U 0 i (multiple lower triangular systems) Step 3: A i 0 = L i 0 ∗ U 00 (multiple upper triangular systems) Step 4: A ij = A ij − L i 0 ∗ U 0 j (update south-east blocks) Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 8 / 24

Implementations Different implementations based on the structure by blocks: Shared-memory assignation of the work with the blocks to different threads use of multithread libraries Message-passing distribution of blocks to the processes communication of blocks needed for local computation GPU use of libraries for GPU assignation of blocks to CPU and GPU Out-of-core blocks stored in secondary memory brought to main memory for computation Heterogeneous systems balanced assignation of blocks to the computational components Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 9 / 24

Organization and methodology Students with different knowledge from different universities, degrees and specializations and interests from companies optional subject HPC used in their Master’s Thesis Master’s Thesis on HPC ⇒ Problem-based learning, favors autonomous work and individual supervision. Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 11 / 24

Initial sessions Presentation presents the course, its organization, the problem to work with and the tasks to be done by the students OpenMP and MPI two sessions are organized outside the general course timetable for students without knowledge of parallel programming Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 12 / 24

Matrix algorithms Basic concepts of sparse and dense basic linear algebra routines. Column and row major storage schemes, concept of leading dimension. Algorithms by blocks. Basic routines. LU factorization, versions without blocks and by blocks. Precision issues. Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 13 / 24

Numerical libraries General structure of numerical libraries Centered on dense linear algebra libraries: Basic routines: structure of BLAS multithread implementations (MKL, GotoBLAS, ATLAS) auto-tuning (ATLAS) Higher level routines: structure of LAPACK multithread implementations (MKL) alternative approaches (PLAPACK) recent efforts of optimization for multicore (PLASMA) Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 14 / 24

Practical on basic algorithms and multithread libraries Compare the execution time of versions of the LU: Sequential without and with blocks, Blocks with matrix multiplication with different basic libraries (MKL, GotoBLAS and ATLAS) Direct calls to LU in MKL and PLASMA. �� "# $%&# �� !"��! � Speed-up of different versions of the LU factorization with respect to the sequential implementation. In a NUMA with 4 hexa-cores. Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 15 / 24

GPU Basic concepts of GPU programming with CUDA In the second semester a course on Advanced Programming of Multicore Architectures No implementations of LU for GPU Use of linear algebra libraries for GPU (CULA, CUBLAS, MAGMA) Load balancing CPU-GPU Cost of data transference Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 16 / 24

Shared-memory algorithms OpenMP versions reusing the ideas from block algorithms Multilevel parallelism: two-level OpenMP routines OpenMP + multithread libraries different numbers of threads at BLAS level and higher level in MKL routines In the practical , study of the optimal number of OpenMP threads and library threads. �� !�� Comparison of the execution time of different OpenMP+MKL versions. Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 17 / 24

Out-of-core algorithms Scientific problems with large memory requirement Out-of-core linear algebra libraries In/Out libraries Algorithms for out-of-core LU factorization In the practical , out-of-core implementations and combination with OpenMP . �� ! �"��# $ �� Comparison of the execution time of different out-of-core versions. Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 18 / 24

Message-passing algorithms Some basic ideas for the development of CPU+GPU and message-passing versions of the LU are discussed Message-passing linear algebra routines Libraries for distributed systems (ScaLAPACK) Distributed memory LU factorization In the practical , combination of the paradigms studied with MPI to implement LU for large matrices in an heterogeneous cluster with 52 cores and 10 GPUs: One quad-core + 1 GPU gforce 112 cores. One NUMA with 4 hexa-cores + 1 GPU Kepler 2048 cores. Two hexa-cores, each with 1 GPU gforce 512 cores. One node with 2 hexa-cores + 4 GPU gforce each 512 cores + 2 GPU Tesla each 448 cores. There are many possible combinations. The students decide which to explore, depending on their interest and the possible application to their work for the Master’s Thesis. Bernabé et al. (SCPPG) HPC course guided by the LU WTCS, June 10-12, 2014 19 / 24

A High Performance Computing Course Guided by the LU Factorization - PowerPoint PPT Presentation

A High Performance Computing Course Guided by the LU Factorization Gregorio Bernab , Javier Cuenca, Domingo Gimnez, Luis P . Garca and Sergio Rivas Universidad de Murcia/Universidad Politcnica de Cartagena Scientific Computing and

FFR Guided Functional FFR Guided Functional FFR Guided Functional FFR Guided Functional

Guided Therapeutics in Cancer Surgery Guided Therapeutics in Cancer Surgery Guided Therapeutics

MVC Guided Pathways Brief review of Guided Pathways at MVC Plan for Today Spring

Year 3 Guided Pathways Plan Presentation Presented by: Palomar Guided Pathways Team DATE: May

Guided Pathways Equity & Education Update Feb 7, 2020 Guided Pathways Decision Making

Prediction-Guided Performance-Energy Trade-off for Interactive Applications Daniel Lo Taejoon

New York University High Performance Computing High Performance Computing Information

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

Course Orientation q Course Description q Course Outcomes q Course Requirements q Course Outline

Guided Pathways 101 The Guided Pathways framework creates a highly structured approach to student

for Active Learning Guided Inquiry Learning The POGIL Project Process Oriented, Guided

Structure-Guided Discovery of ( S) -3 - Structure-Guided Discovery of ( S) -3 - ( am inom ethyl)

32b Passive Stretches: Guided Full Body 32b Passive Stretches: Guided Full Body Class Outline

Year 2 Guided Pathways Plan Presentation Presented by: Palomar Guided Pathways Team Wednesday

Major Clusters Porterville College January 11th, 2019 -- Flex Day Guided Pathways A Brief

Classification of complex semisimple Lie algebras by root systems Ian Xiao Supervised by: Dr.

On Lie modules of Banach space nest algebras Pedro Capit ao Instituto Superior T ecnico

lCARE - localising Conditional AutoRegressive Expectiles Wolfgang Karl Hrdle Xiu Xu Andrija

Scotlands rural economies Jane Atterton Monday 5 th Dec 2016 Event Date Rural Scotland in

CSI T 2 0 1 3 Rational Arithm etic w ith Floating Point Vaclav Skala University of West

Deciding on the type of a graph from a BFS Wang Xiaomin Joint work with Matthieu Latapy and

National Disaster Resilience (CDBG-NDR) Competition Unmet Recovery Need Threshold Criteria

altered fractionated RT in HNSCC : w hat is the m agnitude of the benefit ? Jean Bourhis, MD PhD

A High Performance Computing Course Guided by the LU Factorization - PowerPoint PPT Presentation

A High Performance Computing Course Guided by the LU Factorization Gregorio Bernab , Javier Cuenca, Domingo Gimnez, Luis P . Garca and Sergio Rivas Universidad de Murcia/Universidad Politcnica de Cartagena Scientific Computing and

FFR Guided Functional FFR Guided Functional FFR Guided Functional FFR Guided Functional

Guided Therapeutics in Cancer Surgery Guided Therapeutics in Cancer Surgery Guided Therapeutics

MVC Guided Pathways Brief review of Guided Pathways at MVC Plan for Today Spring

Year 3 Guided Pathways Plan Presentation Presented by: Palomar Guided Pathways Team DATE: May

Guided Pathways Equity &amp; Education Update Feb 7, 2020 Guided Pathways Decision Making

Prediction-Guided Performance-Energy Trade-off for Interactive Applications Daniel Lo Taejoon

New York University High Performance Computing High Performance Computing Information

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

Course Orientation q Course Description q Course Outcomes q Course Requirements q Course Outline

Guided Pathways 101 The Guided Pathways framework creates a highly structured approach to student

for Active Learning Guided Inquiry Learning The POGIL Project Process Oriented, Guided

Structure-Guided Discovery of ( S) -3 - Structure-Guided Discovery of ( S) -3 - ( am inom ethyl)

32b Passive Stretches: Guided Full Body 32b Passive Stretches: Guided Full Body Class Outline

Year 2 Guided Pathways Plan Presentation Presented by: Palomar Guided Pathways Team Wednesday

Major Clusters Porterville College January 11th, 2019 -- Flex Day Guided Pathways A Brief

Classification of complex semisimple Lie algebras by root systems Ian Xiao Supervised by: Dr.

On Lie modules of Banach space nest algebras Pedro Capit ao Instituto Superior T ecnico

lCARE - localising Conditional AutoRegressive Expectiles Wolfgang Karl Hrdle Xiu Xu Andrija

Scotlands rural economies Jane Atterton Monday 5 th Dec 2016 Event Date Rural Scotland in

CSI T 2 0 1 3 Rational Arithm etic w ith Floating Point Vaclav Skala University of West

Deciding on the type of a graph from a BFS Wang Xiaomin Joint work with Matthieu Latapy and

National Disaster Resilience (CDBG-NDR) Competition Unmet Recovery Need Threshold Criteria

altered fractionated RT in HNSCC : w hat is the m agnitude of the benefit ? Jean Bourhis, MD PhD

Guided Pathways Equity & Education Update Feb 7, 2020 Guided Pathways Decision Making