Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. - PowerPoint PPT Presentation

Lecture 18 CSE 260 – Parallel Computation (Fall 2015) Scott B. Baden Large scale computing

Announcements • Office hours on Wednesday u 3:30 PM until 5:30 u I’ll stay after 5:30 until the last one leaves • Test on the last day of class u BRING A BLUE BOOK u Tests your ability to apply the knowledge you’ve gained in the course u Open book, open notes u You may bring a PDF viewer (e.g. preview, Acrobat) to look at course materials only • No web browsing - turn off internet • No cell phones Scott B. Baden / CSE 260, UCSD / Fall '15 2

Today’s lecture • Supercomputers • Archiectures • Applications Scott B. Baden / CSE 260, UCSD / Fall '15 3

What is the purpose of a supercomputer? • Improve our understanding of scientific and technologically important phenomenon • Improve the quality of life through technological innovation, simulations, data processing u Data Mining u Image processing u Simulations – financial modeling, weather, biomedical • Economic benefits Scott B. Baden / CSE 260, UCSD / Fall '15 4

What is the world’s fastest supercomputer? • Top500 #1 (>2 years), Tianhe– 2 @ NUDT (China) 3.12M cores, 54.8 Pflops peak, 17.8MW power+6W cooling , 12-core Ivy Bridge + Intel Phi • #2: Titan @ Oak Ridge, USA, 561K cores, 27PF, 8.2MW, Cray XK7: AMD Opteron + Nvida Kepler K20x top500.org Scott B. Baden / CSE 260, UCSD / Fall '15 5

What does a supercomputer look like? • Hierarchically organized servers • IrrHybrid communication u Threads within the server u Pass messages between servers (or among groups of cores) Edison @ nersc.gov conferences.computer.org/sc/2012/papers/1000a079.pdf Scott B. Baden / CSE 260, UCSD / Fall '15 6

State-of-the-art applications Blood Simulation on Jaguar Ab Initio Molecular Dynamics (AIMD) using Gatech team Plane Waves Density Functional Theory Eric Bylaska (PNNL) 48 384 3072 24576 p Exchange time Time (sec) 899 . 8 116 . 7 16 . 7 4 . 9 on HOPPER Efficiency 1.00 0.96 0.84 0.35 Strong scaling 24576 98304 196608 p Time (sec) 228 . 3 258 304 . 9 Efficiency 1.00 0.88 0.75 Slide courtesy Weak scaling Tan Nguyen, UCSD Scott B. Baden / CSE 260, UCSD / Fall '15 7

Performance differs across application domains u Collela’s 7 dwarfs, patterns of communication and computation that persist over time and across implementations u Structured grids A[i,:] • Panfilov method C[i,j] B[:,j] += * u Dense linear algebra • Matrix multiply, Vector-Mtx Mpy Gaussian elimination u N-body methods u Sparse linear algebra • With sparse matrix, use knowledge about the locations of non-zeroes to improve some aspect of performance u Unstructured Grids u Spectral methods (FFT) u Monte Carlo Scott B. Baden / CSE 260, UCSD / Fall '15 8

Application-specific knowledge is important • Currently exists no tool that can convert a serial program into an efficient parallel program … for all applications … all of the time … on all hardware • The more we know about the application… … specific problem … math/physics ... initial data … … context for analyzing the output… … the more we can improve performance • Performance Programming Issues 4 Data motion and locality 4 Load balancing 4 Serial sections Scott B. Baden / CSE 260, UCSD / Fall '15 9

Sparse Matrices • A matrix where we employ knowledge about the location of the non-zeroes • Consider Jacobi’s method with a 5-point stencil u’[i,j] = (u[i-1,j] + u[i+1,j]+ u[i,j-1]+ u[i, j+1] - h 2 f[i,j]) /4 Scott B. Baden / CSE 260, UCSD / Fall '15 10

Web connectivity Matrix: 1M x 1M 1M x 1M submatrix of the web connectivity graph, constructed from an archive at the Stanford WebBase 3 non-zeroes/row Dense: 2 20 × 2 20 = 2 40 = 1024 Gwords Sparse: (3/2 20 ) × 2 40 = 3 Mwords Sparse representation saves a factor of 1 million in Jim Demmel storage Scott B. Baden / CSE 260, UCSD / Fall '15 11

Circuit Simulation Motorola Circuit 170,998 2 958,936 nonzeroes .003% nonzeroes 5.6 nonzeroes/row www.cise.ufl.edu/research/sparse/matrices/Hamm/scircuit.html Scott B. Baden / CSE 260, UCSD / Fall '15 12

Generating sparse matrices from unstructured grids • In some applications of sparse matrices, we generate the matrix from an “unstructured” mesh, e.g. finite element method • In some cases we apply direct mesh updates, using nearest neighbors A 2D airfoil • Irregular partitioning Scott B. Baden / CSE 260, UCSD / Fall '15 13

Sparse Matrix Vector Multiplication • Important kernel used in linear algebra • Assume x[] fits in memory of 1 processor y[i] += A[i,j] × x[j] • Many formats, common format for CPUs is Compressed Sparse Row (CSR) Jim Demmel Scott B. Baden / CSE 260, UCSD / Fall '15 14

Sparse matrix vector multiply kernel // y[i] += A[i,j] × x[j] #pragma parallel for schedule (dynamic,chunk) for i = 0 : N-1 // rows i0= ptr[i] i1 = ptr[i+1] – 1 X A j→ for j = i0 : i1 // cols y[ ind[j]] += i val[ j ] *x[ ind[j] ] ↓ end j end i Scott B. Baden / CSE 260, UCSD / Fall '15 15

Up and beyond to Exascale • In 1961, President Kennedy mandated a landing on the Moon by the end of the decade • July 20, 1969 at tranquility base “The Eagle has landed” • The US Government set an ambitious schedule to reach 10 18 flops by ~2023 • DOE is taking the lead in the US, EU also engaged • Massive technical challenges Scott B. Baden / CSE 260, UCSD / Fall '15 16

The Challenges to landing “Eagle” • High levels of parallelism within and across nodes u 10 18 flops using NVIDIA devices @ 10 12 flops u 10 6 devices. 10 9+ threads • Power : ≤ 20 MW. Today 18MW@0.05 ExaFlops u Power consumption 1-2nJ/op today → 20pJ @Exascale u Data storage & access consumes most of the energy • Ever lengthening communication delays u Complicated memory hierarchies u Raise amount of computation per unit of communication u Hide latency, conserve locality • Reliability and resilience u Blue Gene L’s Mean Time Between Failure (MTBF( measured in days • Application code complexity; domain specific languages u NUMA processors, not fully cache coherent on-chip u Mixture of accelerators and conventional cores Scott B. Baden / CSE 260, UCSD / Fall '15 17

Technological trends • Growth: cores/socket rather than sockets • Heterogeneous processors • Memory/core is shrinking • Complicated software-managed parallel memory hierarchy • Communication costs increasing relative to computation Intel Sandybridge, anandtech.com Scott B. Baden / CSE 260, UCSD / Fall '15 18

35 years of processor trends Scott B. Baden / CSE 260, UCSD / Fall '15 19

How do we manage these constraints? • Increase amount of computation performed per unit of communication u Conserve locality, “communication avoiding” • Hide communication • Many threads Exa??? Improvement Peta Processor Memory Tera Bandwidth Giga Latency Year Scott B. Baden / CSE 260, UCSD / Fall '15 20

A Crosscutting issue: hiding communication • Little’s law [1961] The number of threads must equal the parallelism times the latency u T = p ×λ p and λ are increasing with time u • Difficult to implement Split phase algorithms u Partitioning and scheduling u • The state-of-the-art enables but doesn’t support the activity • Distracts from the focus on the domain science • Implementation policies entangled with correctness issues Non-robust performance u High development costs u Scott B. Baden / CSE 260, UCSD / Fall '15 21 21

Motivating application • Solve Laplace’s equation in 3 dimensions with Dirichlet Boundary conditions Ω Δϕ = ρ (x,y,z), ϕ =0 on ∂Ω • Building block: iterative solver using ρ≠ 0 Jacobi’s method (7-point stencil) ∂Ω for (i,j,k) in 1:N x 1:N x 1:N u’[i][j][k] = (u[i-1][j][k] + u[i+1][j][k] + u[i][j-1][k] + u[i][j+1][k] + u[i][j][k+1] + u[i][j][k-1] ) / 6.0 Scott B. Baden / CSE 260, UCSD / Fall '15 22

Classic message passing implementation • Decompose domain into sub-regions, one per process u Transmit halo regions between processes u Compute inner region after communication completes • Loop carried dependences impose a strict ordering on communication and computation Scott B. Baden / CSE 260, UCSD / Fall '15 23

Communication tolerant variant • Only a subset of the domain exhibits loop carried dependences with respect to the halo region • Subdivide the domain to remove some of the dependences • We may now sweep the inner region in parallel with communication • Sweep the annulus after communication finishes Scott B. Baden / CSE 260, UCSD / Fall '15 24

Processor Virtualization • Virtualize the processors by overdecomposing • AMPI [Kalé et al.] • When an MPI call blocks, thread yields to another virtual process • How do we inform the scheduler about ready tasks? Scott B. Baden / CSE 260, UCSD / Fall '15 25

Observations • The exact execution order depends on the data dependence structure: communication & computation • But many other correct orderings are possible, and some can enable us to hide communication • We can characterize the running program in terms of a task precedence graph • There is a deterministic procedure for translating MPI code into the graph Irecv j Irecv j 0 2 Send j Send j 1 Wait Wait 3 4 Comp Comp SPMD MPI TASK GRAPH Scott B. Baden / CSE 260, UCSD / Fall '15 26

Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. - PowerPoint PPT Presentation

Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Large scale computing Announcements Office hours on Wednesday u 3:30 PM until 5:30 u Ill stay after 5:30 until the last one leaves Test on the last day of class

Lecture 3 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Address space

Lecture 1 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Introduction Welcome to

Lecture 12 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Stencil methods

Lecture 13 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Message Passing Stencil

Lecture 10 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Looking at PTX code

Welcome to CSE 160! Introduction to parallel computation Scott B. Baden Welcome to Parallel

260 SOUTH STREET 1 260 SOUTH STREET NY, NY 260 SOUTH STREET NY, NY CB3 LAND USE COMMITTEE

CSE 262 Lecture 7 Parallel Matrix Multiplication Announcements Projects Scott B. Baden /CSE

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

Parallel Computing Basics Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

CSE Fall 2014 311 Lecture 1 Lecture 1 Lecture 1: Propositional Logic Lecture 1 Foundations

Models of Parallel Computation Mark Greenstreet CpSc 418 Oct. 10, 2013 The RAM Model of

CSL 860: Modern Parallel Computation Computation Hello OpenMP #pragma omp parallel { // I am

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Cache Coherence Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Cache

Memory Consistency Models Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

CS 5220: Introduction David Bindel 2017-08-22 1 CS 5220: Applications of Parallel Computers

Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/20/13 1 TPP performance

http://xkcd.com/1270/ Review: SecretKeeper Language e ::= true | false | n | if e then e else e

Intel Labs Haskell Research Compiler Hai (Paul) Liu with Neal Glew, Leaf Peterson, Todd A.

Porting Atmospheric Programs on Various Systems Lin Gan Tsinghua University NSCC-Wuxi

BOAST Performance Portability Using Meta-Programming and Auto-Tuning Frdric Desprez 1 , Brice

Introduction to Performance Analysis Visualization and Analysis of Performance on Large-scale

Capacity Planning of Supercomputers Simulating MPI Applications at Scale Tom Cornebize Under the