manycore manycore computing manycore manycore computing
play

ManyCore ManyCore Computing: ManyCore ManyCore Computing: - PowerPoint PPT Presentation

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on Numerical The Impact on Numerical S S oftware for Linear Algebra oftware for Linear Algebra Libraries Libraries Libraries Libraries Jack


  1. ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on Numerical The Impact on Numerical S S oftware for Linear Algebra oftware for Linear Algebra Libraries Libraries Libraries Libraries Jack Dongarra k INNOVATIVE COMP ING LABORATORY U i University of Tennessee i f T Oak Ridge National Laboratory University of Manchester 11/20/2007 1

  2. Performance Proj ection Performance Proj ection Top500 Data Top500 Data p 1 F/s 100 PF/s 10 PF/s 1 PF/s SUM SUM 100 TF/s 10 TF/s 1 TF/s 1 TF/s 100 GF/s N=1 10 GF/ 10 GF/s 1 GF/s N=500 100 MF/s 1993 1995 1997 1999 2001 2003 2005 2007 2009 2

  3. What Will a What Will a Petascale Petascale S S ystem Looks Like? ystem Looks Like? Possible Petascale System 1. # of cores per nodes p 10 – 100 cores 2. Performance per nodes 100 – 1,000 GFlop/s 3. Number of nodes 1,000 - 10,000 nodes 4 Latency inter nodes 4. Latency inter-nodes 1 μ sec 1 μ sec 5. Bandwidth inter-nodes 10 GB/s 6. Memory per nodes 10 GB Part I: First rule in linear algebra: Have an efficient DGEMM • � Motivation in 2. performance per node 5. bandwidth inter-nodes 6. memory per nodes p p y p • Part II: Algorithms for multicore and latency avoiding algorithms for LU, QR … � Motivation in: 1 Number of cores per node 2 performance per node 4 Latency inter-nodes 1. Number of cores per node 2. performance per node 4. Latency inter nodes Part III: Algorithms for fault tolerance • � Motivation in: 1. Number of cores per node 3. number of nodes

  4. Maj or Changes to S Maj or Changes to S oftware oftware • Must rethink the design of our software software � Another disruptive technology • Similar to what happened with cluster computing and message passing � Rethink and rewrite the applications, algorithms and software algorithms, and software • Numerical libraries for example will change change � For example, both LAPACK and ScaLAPACK will undergo major changes g j g to accommodate this 4

  5. Coding for an Coding for an Abstract Abstract M Multicore ulticore Parallel software for multicores should have two characteristics: • Fine granularity: � high level of parallelism is needed � cores will probably be associated with relatively small � cores will probably be associated with relatively small local memories. This requires splitting an operation into tasks that operate on small portions of data in order to reduce bus traffic and improve data locality reduce bus traffic and improve data locality. • Asynchronicity: as the degree of TLP grows and granularity of the operations becomes smaller, the presence of synchronization points in a parallel execution seriously affects the efficiency of an algorithm.

  6. ManyCore ManyCore - - Parallelism for the Parallelism for the Masses Masses • We are looking at the following g g concepts in designing the next numerical library implementation y p � Dynamic Data Driven Execution � Self Adapting p g � Block Data Layout � Mixed Precision in the Algorithm Mixed Precision in the Algorithm � Exploit Hybrid Architectures � Fault Tolerant Methods Fault Tolerant Methods 6

  7. A New Generation of S A New Generation of S oftware: oftware: Algorithms follow hardware evolution in time LINP ACK (70’s) Rely on (Vector operations) (Vector operations) - Level-1 BLAS Level 1 BLAS operations LAP LAP ACK (80’s) ACK (80 s) Rely on Rely on (Blocking, cache - Level-3 BLAS friendly) operations S caLAP ACK (90’s) Rely on (Distributed Memory) - PBLAS Mess Passing PLAS MA (00’s) Rely on New Algorithms - a DAG/ scheduler (many-core friendly) - block data layout - some extra kernels Those new algorithms - have a very low granularity, they scale very well (multicore, petascale computing, … ) h l l it th l ll ( lti t l ti ) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms.

  8. A New Generation of S A New Generation of S oftware: oftware: Parallel Linear Algebra S Parallel Linear Algebra S oftware for oftware for Multicore Multicore Architectures (PLAS Architectures (PLAS MA) MA) Algorithms follow hardware evolution in time LINP ACK (70’s) Rely on (Vector operations) (Vector operations) - Level-1 BLAS Level 1 BLAS operations LAP LAP ACK (80’s) ACK (80 s) Rely on Rely on (Blocking, cache - Level-3 BLAS friendly) operations S caLAP ACK (90’s) Rely on (Distributed Memory) - PBLAS Mess Passing PLAS MA (00’s) Rely on New Algorithms - a DAG/ scheduler (many-core friendly) - block data layout These new algorithms - have a very low granularity, they scale very well (multicore, petascale computing, … ) h l l it th l ll ( lti t l ti ) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms.

  9. Developing P Developing Parallel arallel A Algorithms lgorithms parallelism LAPACK LAPACK parallelism Threaded PThreads OpenMP s BLAS sequential sequential PThreads OpenMP BLAS BLAS

  10. Steps in the LAPACK LU Steps in the LAPACK LU DGETF2 LAPACK (Factor a panel) DLSWP LAPACK (B (Backward swap) k d ) DLSWP DLSWP LAPACK LAPACK (Forward swap) DTRSM BLAS (Triangular solve) DGEMM BLAS (Matrix multiply) 10

  11. LU Timing Profile (4 LU Timing Profile (4 core core system) system) Threads – no lookahead Time for each component DGETF2 DLASWP(L) DLASWP(R) DTRSM DGEMM DGETF2 DLSWP DLSWP DTRSM Bulk Sync Phases Bulk Sync Phases DGEMM

  12. Adaptive Lookahead Adaptive Lookahead - Dynamic Dynamic Reorganizing algorithms to use Event Driven Multithreading Event Driven Multithreading 12 this approach

  13. Fork Fork- -Join vs. Dynamic Execution Join vs. Dynamic Execution A A T T T Fork-Join – parallel BLAS B C C Time Experiments on Experiments on pe pe e ts o e ts o Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown 13 with 2 Sockets w/ 8 Treads with 2 Sockets w/ 8 Treads

  14. Fork Fork- -Join vs. Dynamic Execution Join vs. Dynamic Execution A A T T T Fork-Join – parallel BLAS B C C Time DAG-based – dynamic scheduling Time saved Experiments on Experiments on pe pe e ts o e ts o Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown 14 with 2 Sockets w/ 8 Treads with 2 Sockets w/ 8 Treads

  15. Achieving Achieving A Asynchronicity synchronicity The matrix factorization can be The matrix factorization can be represented as a DAG: • nodes: tasks that operate on “tiles” • edges: dependencies among tasks • edges: dependencies among tasks Tasks can be scheduled asynchronously and in any order as long as dependencies are not violated.

  16. Achieving Achieving A Asynchronicity synchronicity A critical path can be defined as the A critical path can be defined as the shortest path that connects all the nodes with the higher number of outgoing edges. t i d Priorities: Priorities:

  17. Achieving asynchronicity Achieving Achieving A Asynchronicity synchronicity � Very fine granularity � Few dependencies, i.e., high flexibility for the scheduling of flexibility for the scheduling of tasks asynchronous scheduling � No idle times No idle times � Some degree of adaptativity � Better locality thanks to block data layout

  18. Cholesky Cholesky Factorization Factorization DAG DAG- -based based Dependency Tracking Dependency Tracking 1: 1 1: 1: 1: 1: 1: 1: 1:1 1:1 2 3 4 1:2 2:2 2: 2: 2: 2 2 3 3 4 4 1:3 2:3 3:3 2: 2 1:4 2:4 3:4 4:4 2: 2: 3 4 3: 3: 3: 3: Dependencies expressed by the DAG Dependencies expressed by the DAG 3 4 are enforced on a tile basis: � fine-grained parallelization 3: � flexible scheduling 3 3

  19. Cholesky Cholesky on the IBM Cell on the IBM Cell Pi Pipelining: li i � Between loop iterations. Double Buffering: � Within BLAS, � Within BLAS � Between BLAS, � Between loop iterations. Result: � Minimum load imbalance, � Minimum dependency stalls, � Minimum memory stalls (no waiting for data). ( g ) Achieves 174 Gflop/s; 85% of peak in SP. 19

  20. Cholesky Cholesky - - Using 2 Cell Chips Using 2 Cell Chips 20

  21. Parallelism in LAPACK: Parallelism in LAPACK: Blocked Storage Blocked Storage Column-Major Column Major

  22. Parallelism in LAPACK: Parallelism in LAPACK: Blocked Blocked S Storage torage Column-Major Column Major Blocked Blocked

  23. Parallelism in LAPACK: blocked storage Parallelism in LAPACK: blocked storage Column-Major Column Major Blocked Blocked

  24. Parallelism in LAPACK: Parallelism in LAPACK: Blocked Blocked S Storage torage The use of blocked storage can significantly improve performance p p Blocking Speedup 2 2 1.8 DGEMM DTRSM 1.6 1.4 speedup 1.2 1 0.8 0.6 0.4 0.2 0 0 64 128 256 block size

Recommend


More recommend