Features to Consider When Computing at Scale Jack Dongarra - PowerPoint PPT Presentation

Five Important Features to Consider When Computing at Scale Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 2/13/2009 1

10 Fastest Computers Procs/C Rmax Rmax/ Power Rank Site Computer Country MF/W ores [Tflops] Rpeak [MW] IBM / Roadrunner - 1 DOE/NNSA/LANL USA 129600 1105.0 76% 2.48 445 BladeCenter QS22/LS21 DOE/Oak Ridge Cray / Jaguar - Cray XT5 QC 2 USA 150152 1059.0 77% 6.95 152 National Laboratory 2.3 GHz NASA/Ames Research SGI / Pleiades - SGI Altix ICE 3 USA 51200 487.0 80% 2.09 233 Center/NAS 8200EX IBM / eServer Blue Gene 4 DOE/NNSA/LLNL USA 212992 478.2 80% 2.32 205 Solution DOE/Argonne National 5 IBM / Blue Gene/P Solution USA 163840 450.3 81% 1.26 357 Laboratory NSF/Texas Advanced 6 Computing Sun / Ranger - SunBlade x6420 USA 62976 433.2 75% 2.0 217 Center/Univ. of Texas 7 DOE/NERSC/LBNL Cray / Franklin - Cray XT4 USA 38642 266.3 75% 1.15 232 DOE/Oak Ridge 8 Cray / Jaguar - Cray XT4 USA 30976 205.0 79% 1.58 130 National Laboratory DOE/NNSA/Sandia 9 Cray / Red Storm - XT3/4 USA 38208 72% 2.5 81 204.2 National Laboratories Shanghai Dawning 5000A, Windows HPC 10 China 30720 77% - - 180.6 Supercomputer Center 2008

Numerical Linear Algebra Library • Interested in developing numerical library for the fastest, largest computer platforms for scientific computing. • Today we have machines with 100K of processors (cores) going to 1M in the next generation • Many important issues must be addressed in the design of algorithms and software. 3

Five Important Features to Consider When Computing at Scale • Effective Use of Many-Core and Hybrid architectures  Dynamic Data Driven Execution  Block Data Layout • Exploiting Mixed Precision in the Algorithms  Single Precision is 2X faster than Double Precision  With GP-GPUs 10x • Self Adapting / Auto Tuning of Software  Too hard to do by hand • Fault Tolerant Algorithms  With 100K – 1M cores things will fail • Communication Avoiding Algorithms  For dense computations from O(n log p) to O( log p) 4 communications  GMRES s-step compute ( x, Ax, A 2 x, … A s x )

A New Generation of Software: Software/Algorithms follow hardware evolution in time LINPACK (70’s) Rely on (Vector operations) - Level-1 BLAS operations LAPACK (80’s) Rely on (Blocking, cache - Level-3 BLAS friendly) operations ScaLAPACK (90’s) Rely on (Distributed Memory) - PBLAS Mess Passing PLASMA (00’s) Rely on New Algorithms - a DAG/scheduler (many-core friendly) - block data layout - some extra kernels Those new algorithms - have a very low granula anularit rity, they scale very well (multicore , petascale computing, … ) - remov oves a lots of depend penden encie ies among the tasks, (multicore, distributed computing) - avoid oid laten ency (distributed computing, out-of-core) - rely ly on fast kernels rnels Those new algorithms need new kernels and rely on efficient scheduling algorithms.

A New Generation of Software: Parallel Linear Algebra Software for Multicore Architectures (PLASMA) Software/Algorithms follow hardware evolution in time LINPACK (70’s) Rely on (Vector operations) - Level-1 BLAS operations LAPACK (80’s) Rely on (Blocking, cache - Level-3 BLAS friendly) operations ScaLAPACK (90’s) Rely on (Distributed Memory) - PBLAS Mess Passing PLASMA (00’s) Rely on New Algorithms - a DAG/scheduler (many-core friendly) - block data layout - some extra kernels Those new algorithms - have a very low granula anularit rity, they scale very well (multicore , petascale computing, … ) - remov oves a lots of depend penden encie ies among the tasks, (multicore, distributed computing) - avoid oid laten ency (distributed computing, out-of-core) - rely ly on fast kernels rnels Those new algorithms need new kernels and rely on efficient scheduling algorithms.

Major Changes to Software • Must rethink the design of our software  Another disruptive technology • Similar to what happened with cluster computing and message passing  Rethink and rewrite the applications, algorithms, and software • Numerical libraries for example will change  For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this 9

LAPACK and ScaLAPACK ScaLAPACK LAPACK PBLAS parallelism Global Local Threaded BLAS BLACS PThreads OpenMP Mess Passing (MPI , PVM, ...) About 1 million lines of code

Steps in the LAPACK LU DGETF2 LAPACK (Factor a panel) DLSWP LAPACK (Backward swap) DLSWP LAPACK (Forward swap) DTRSM BLAS (Triangular solve) DGEMM BLAS (Matrix multiply) 11

LU Timing Profile (4 processor system) Threads – no lookahead Time for each component DGETF2 DLASWP(L) DLASWP(R) DTRSM DGEMM DGETF2 DLSWP DLSWP DTRSM Bulk k Sync c Phase ses DGEMM

Adaptive Lookahead - Dynamic Event ent Drive ven Multithrea ithreading ding Ideas as not new. Many ny papers ers use the DAG AG appro roac ach. h. Reorganizing algorithms to use 13 this approach

Achieving Fine Granularity Fine granularity may require novel data formats to overcome the limitations of BLAS on small chunks of data. Column-Major

Achieving Fine Granularity Fine granularity may require novel data formats to overcome the limitations of BLAS on small chunks of data. Column-Major Blocked

PLASMA (Redesign LAPACK/ScaLAPACK) Parallel Linear Algebra Software for Multicore Architectures • Asychronicity • Avoid fork-join (Bulk sync design) • Dynamic Scheduling • Out of order execution • Fine Granularity • Independent block operations • Locality of Reference • Data storage – Block Data Layout Lead by Tennessee and Berkeley similar to LAPACK/ScaLAPACK as a community effort 16

Intel’s Clovertown Quad Core 3 Implementations of LU factorization 1. LAPACK CK (BLAS Fork-Jo Join Parall ralleli lism) sm) Quad core w/2 sockets per board, w/ 8 Treads 2. ScaLAPAC LAPACK K (Mes ess Pass using ing mem copy py) 3. DAG Based ed (Dynam namic ic Sched edulin uling) 45000 40000 35000 30000 Mflop/s 25000 20000 15000 8 Core Experiments 10000 5000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 17 Problems Size

If We Had A Small Matrix Problem • We would generate the DAG, find the critical path and execute it. • DAG too large to generate ahead of time  Not explicitly generate  Dynamically generate the DAG as we go • Machines will have large number of cores in a distributed fashion  Will have to engage in message passing  Distributed management  Locally have a run time system

The DAGs are Large • Here is the DAG for a factorization on a 20 x 20 matrix • For a large matrix say O(10 6 ) the DAG is huge • Many challenges for the software 19

Each Node or Core Will Have A Run Time System  some dependencies satisfied  waiting for all dependencies BIN 1  all dependencies satisfied  some data delivered  waiting for all data BIN 2  all data delivered  waiting for execution BIN 3 20

Some Questions • What’s the best way to represent the DAG? • What’s the best approach to dynamically generating the DAG? • What run time system should we use?  We will probably build something that we would target to the underlying system’s RTS.  Per node or core? • What about work stealing?  Can we do better than nearest neighbor work stealing? • What does the program look like?  Experimenting with SMPss, Cilk, Charm++, UPC, Intel Threads  We would like to reuse as much of the existing software as possible  For software reuse, looking at a set of Task-BLAS with work 21 with a RTS

Features to Consider When Computing at Scale Jack Dongarra - PowerPoint PPT Presentation

Five Important Features to Consider When Computing at Scale Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 2/13/2009 1 10 Fastest Computers Procs/C Rmax Rmax/ Power Rank Site Computer

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

BLOGGING How to blog well FEATURES OF A BLOG... FEATURES OF A BLOG... Chronological

New features of 660 MW Units Turbine Maintenance Sipat Super thermal power project New features

reliable innovation Main features Main features Lower part detachment Concept advantages

New Type Inference & Related Language Features Svetlana Isakova @sveta_isakova Agenda

Supervised Learning Given: a set of inputs features X 1 , . . . , X n a set of target features Y 1

Image Features Sanja Fidler CSC420: Intro to Image Understanding 1 / 64 Image Features Image

Climate Sensitivity We consider climate sensitivity in a very simple context. Climate Sensitivity

Image Features: Detection, Description, and Matching and their Applications Image

Advanced Features & Ecommerce Features & Ecommerce Configuration Guide Presented by:

The e-Application Form Main features Main features ENI CBC Med Programme - Managing Authority

Considered ELA Features Sunday, July 15, 2018 9 / 20 Considered ELA Features Meta-Model

Mark Pagel pags@cray.com New features in XT MPT 3.1 and MPT 3.2 Features as a result of

Efficient visual search of local features Cordelia Schmid Bag-of-features

OLAP & Data Mining OLAP & Data Mining Agenda Agenda SQL Server Features (in short) SQL

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Computer Architecture : A Programmers Perspective Abhishek Somani, Debdeep Mukhopadhyay Mentor

Computer Architecture Pipelining and Instruction Level ParallelismAn Introduction Adapted

CS-1000 An Introduction to Computer Architecture Dr. Soner Onder Michigan Tech October 13, 2015

Computer Architecture and OS EECS678 Lecture 2 1 Recap What is an OS? An intermediary

Systems with General Intelligence A New Perspective Michael Thielscher Outline PART I A

Introduction Philipp Koehn 28 January 2020 Philipp Koehn Artificial Intelligence: Introduction

Tactical and Strategic AI Marco Chiarandini Department of Mathematics & Computer Science

CSE5390 & 7390 Special Topics in Ubiquitous Computing lecture eight, beyond the screen Eric

Features to Consider When Computing at Scale Jack Dongarra - PowerPoint PPT Presentation

Five Important Features to Consider When Computing at Scale Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 2/13/2009 1 10 Fastest Computers Procs/C Rmax Rmax/ Power Rank Site Computer

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

BLOGGING How to blog well FEATURES OF A BLOG... FEATURES OF A BLOG... Chronological

New features of 660 MW Units Turbine Maintenance Sipat Super thermal power project New features

reliable innovation Main features Main features Lower part detachment Concept advantages

New Type Inference &amp; Related Language Features Svetlana Isakova @sveta_isakova Agenda

Supervised Learning Given: a set of inputs features X 1 , . . . , X n a set of target features Y 1

Image Features Sanja Fidler CSC420: Intro to Image Understanding 1 / 64 Image Features Image

Climate Sensitivity We consider climate sensitivity in a very simple context. Climate Sensitivity

Image Features: Detection, Description, and Matching and their Applications Image

Advanced Features &amp; Ecommerce Features &amp; Ecommerce Configuration Guide Presented by:

The e-Application Form Main features Main features ENI CBC Med Programme - Managing Authority

Considered ELA Features Sunday, July 15, 2018 9 / 20 Considered ELA Features Meta-Model

Mark Pagel pags@cray.com New features in XT MPT 3.1 and MPT 3.2 Features as a result of

Efficient visual search of local features Cordelia Schmid Bag-of-features

OLAP &amp; Data Mining OLAP &amp; Data Mining Agenda Agenda SQL Server Features (in short) SQL

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Computer Architecture : A Programmers Perspective Abhishek Somani, Debdeep Mukhopadhyay Mentor

Computer Architecture Pipelining and Instruction Level ParallelismAn Introduction Adapted

CS-1000 An Introduction to Computer Architecture Dr. Soner Onder Michigan Tech October 13, 2015

Computer Architecture and OS EECS678 Lecture 2 1 Recap What is an OS? An intermediary

Systems with General Intelligence A New Perspective Michael Thielscher Outline PART I A

Introduction Philipp Koehn 28 January 2020 Philipp Koehn Artificial Intelligence: Introduction

Tactical and Strategic AI Marco Chiarandini Department of Mathematics &amp; Computer Science

CSE5390 &amp; 7390 Special Topics in Ubiquitous Computing lecture eight, beyond the screen Eric

New Type Inference & Related Language Features Svetlana Isakova @sveta_isakova Agenda

Advanced Features & Ecommerce Features & Ecommerce Configuration Guide Presented by:

OLAP & Data Mining OLAP & Data Mining Agenda Agenda SQL Server Features (in short) SQL

Tactical and Strategic AI Marco Chiarandini Department of Mathematics & Computer Science

CSE5390 & 7390 Special Topics in Ubiquitous Computing lecture eight, beyond the screen Eric