Lecture 18 CSE 260 – Parallel Computation (Fall 2015) Scott B. Baden Large scale computing
Announcements • Office hours on Wednesday u 3:30 PM until 5:30 u I’ll stay after 5:30 until the last one leaves • Test on the last day of class u BRING A BLUE BOOK u Tests your ability to apply the knowledge you’ve gained in the course u Open book, open notes u You may bring a PDF viewer (e.g. preview, Acrobat) to look at course materials only • No web browsing - turn off internet • No cell phones Scott B. Baden / CSE 260, UCSD / Fall '15 2
Today’s lecture • Supercomputers • Archiectures • Applications Scott B. Baden / CSE 260, UCSD / Fall '15 3
What is the purpose of a supercomputer? • Improve our understanding of scientific and technologically important phenomenon • Improve the quality of life through technological innovation, simulations, data processing u Data Mining u Image processing u Simulations – financial modeling, weather, biomedical • Economic benefits Scott B. Baden / CSE 260, UCSD / Fall '15 4
What is the world’s fastest supercomputer? • Top500 #1 (>2 years), Tianhe– 2 @ NUDT (China) 3.12M cores, 54.8 Pflops peak, 17.8MW power+6W cooling , 12-core Ivy Bridge + Intel Phi • #2: Titan @ Oak Ridge, USA, 561K cores, 27PF, 8.2MW, Cray XK7: AMD Opteron + Nvida Kepler K20x top500.org Scott B. Baden / CSE 260, UCSD / Fall '15 5
What does a supercomputer look like? • Hierarchically organized servers • IrrHybrid communication u Threads within the server u Pass messages between servers (or among groups of cores) Edison @ nersc.gov conferences.computer.org/sc/2012/papers/1000a079.pdf Scott B. Baden / CSE 260, UCSD / Fall '15 6
State-of-the-art applications Blood Simulation on Jaguar Ab Initio Molecular Dynamics (AIMD) using Gatech team Plane Waves Density Functional Theory Eric Bylaska (PNNL) 48 384 3072 24576 p Exchange time Time (sec) 899 . 8 116 . 7 16 . 7 4 . 9 on HOPPER Efficiency 1.00 0.96 0.84 0.35 Strong scaling 24576 98304 196608 p Time (sec) 228 . 3 258 304 . 9 Efficiency 1.00 0.88 0.75 Slide courtesy Weak scaling Tan Nguyen, UCSD Scott B. Baden / CSE 260, UCSD / Fall '15 7
Performance differs across application domains u Collela’s 7 dwarfs, patterns of communication and computation that persist over time and across implementations u Structured grids A[i,:] • Panfilov method C[i,j] B[:,j] += * u Dense linear algebra • Matrix multiply, Vector-Mtx Mpy Gaussian elimination u N-body methods u Sparse linear algebra • With sparse matrix, use knowledge about the locations of non-zeroes to improve some aspect of performance u Unstructured Grids u Spectral methods (FFT) u Monte Carlo Scott B. Baden / CSE 260, UCSD / Fall '15 8
Application-specific knowledge is important • Currently exists no tool that can convert a serial program into an efficient parallel program … for all applications … all of the time … on all hardware • The more we know about the application… … specific problem … math/physics ... initial data … … context for analyzing the output… … the more we can improve performance • Performance Programming Issues 4 Data motion and locality 4 Load balancing 4 Serial sections Scott B. Baden / CSE 260, UCSD / Fall '15 9
Sparse Matrices • A matrix where we employ knowledge about the location of the non-zeroes • Consider Jacobi’s method with a 5-point stencil u’[i,j] = (u[i-1,j] + u[i+1,j]+ u[i,j-1]+ u[i, j+1] - h 2 f[i,j]) /4 Scott B. Baden / CSE 260, UCSD / Fall '15 10
Web connectivity Matrix: 1M x 1M 1M x 1M submatrix of the web connectivity graph, constructed from an archive at the Stanford WebBase 3 non-zeroes/row Dense: 2 20 × 2 20 = 2 40 = 1024 Gwords Sparse: (3/2 20 ) × 2 40 = 3 Mwords Sparse representation saves a factor of 1 million in Jim Demmel storage Scott B. Baden / CSE 260, UCSD / Fall '15 11
Circuit Simulation Motorola Circuit 170,998 2 958,936 nonzeroes .003% nonzeroes 5.6 nonzeroes/row www.cise.ufl.edu/research/sparse/matrices/Hamm/scircuit.html Scott B. Baden / CSE 260, UCSD / Fall '15 12
Generating sparse matrices from unstructured grids • In some applications of sparse matrices, we generate the matrix from an “unstructured” mesh, e.g. finite element method • In some cases we apply direct mesh updates, using nearest neighbors A 2D airfoil • Irregular partitioning Scott B. Baden / CSE 260, UCSD / Fall '15 13
Sparse Matrix Vector Multiplication • Important kernel used in linear algebra • Assume x[] fits in memory of 1 processor y[i] += A[i,j] × x[j] • Many formats, common format for CPUs is Compressed Sparse Row (CSR) Jim Demmel Scott B. Baden / CSE 260, UCSD / Fall '15 14
Sparse matrix vector multiply kernel // y[i] += A[i,j] × x[j] #pragma parallel for schedule (dynamic,chunk) for i = 0 : N-1 // rows i0= ptr[i] i1 = ptr[i+1] – 1 X A j→ for j = i0 : i1 // cols y[ ind[j]] += i val[ j ] *x[ ind[j] ] ↓ end j end i Scott B. Baden / CSE 260, UCSD / Fall '15 15
Up and beyond to Exascale • In 1961, President Kennedy mandated a landing on the Moon by the end of the decade • July 20, 1969 at tranquility base “The Eagle has landed” • The US Government set an ambitious schedule to reach 10 18 flops by ~2023 • DOE is taking the lead in the US, EU also engaged • Massive technical challenges Scott B. Baden / CSE 260, UCSD / Fall '15 16
The Challenges to landing “Eagle” • High levels of parallelism within and across nodes u 10 18 flops using NVIDIA devices @ 10 12 flops u 10 6 devices. 10 9+ threads • Power : ≤ 20 MW. Today 18MW@0.05 ExaFlops u Power consumption 1-2nJ/op today → 20pJ @Exascale u Data storage & access consumes most of the energy • Ever lengthening communication delays u Complicated memory hierarchies u Raise amount of computation per unit of communication u Hide latency, conserve locality • Reliability and resilience u Blue Gene L’s Mean Time Between Failure (MTBF( measured in days • Application code complexity; domain specific languages u NUMA processors, not fully cache coherent on-chip u Mixture of accelerators and conventional cores Scott B. Baden / CSE 260, UCSD / Fall '15 17
Technological trends • Growth: cores/socket rather than sockets • Heterogeneous processors • Memory/core is shrinking • Complicated software-managed parallel memory hierarchy • Communication costs increasing relative to computation Intel Sandybridge, anandtech.com Scott B. Baden / CSE 260, UCSD / Fall '15 18
35 years of processor trends Scott B. Baden / CSE 260, UCSD / Fall '15 19
How do we manage these constraints? • Increase amount of computation performed per unit of communication u Conserve locality, “communication avoiding” • Hide communication • Many threads Exa??? Improvement Peta Processor Memory Tera Bandwidth Giga Latency Year Scott B. Baden / CSE 260, UCSD / Fall '15 20
A Crosscutting issue: hiding communication • Little’s law [1961] The number of threads must equal the parallelism times the latency u T = p ×λ p and λ are increasing with time u • Difficult to implement Split phase algorithms u Partitioning and scheduling u • The state-of-the-art enables but doesn’t support the activity • Distracts from the focus on the domain science • Implementation policies entangled with correctness issues Non-robust performance u High development costs u Scott B. Baden / CSE 260, UCSD / Fall '15 21 21
Motivating application • Solve Laplace’s equation in 3 dimensions with Dirichlet Boundary conditions Ω Δϕ = ρ (x,y,z), ϕ =0 on ∂Ω • Building block: iterative solver using ρ≠ 0 Jacobi’s method (7-point stencil) ∂Ω for (i,j,k) in 1:N x 1:N x 1:N u’[i][j][k] = (u[i-1][j][k] + u[i+1][j][k] + u[i][j-1][k] + u[i][j+1][k] + u[i][j][k+1] + u[i][j][k-1] ) / 6.0 Scott B. Baden / CSE 260, UCSD / Fall '15 22
Classic message passing implementation • Decompose domain into sub-regions, one per process u Transmit halo regions between processes u Compute inner region after communication completes • Loop carried dependences impose a strict ordering on communication and computation Scott B. Baden / CSE 260, UCSD / Fall '15 23
Communication tolerant variant • Only a subset of the domain exhibits loop carried dependences with respect to the halo region • Subdivide the domain to remove some of the dependences • We may now sweep the inner region in parallel with communication • Sweep the annulus after communication finishes Scott B. Baden / CSE 260, UCSD / Fall '15 24
Processor Virtualization • Virtualize the processors by overdecomposing • AMPI [Kalé et al.] • When an MPI call blocks, thread yields to another virtual process • How do we inform the scheduler about ready tasks? Scott B. Baden / CSE 260, UCSD / Fall '15 25
Observations • The exact execution order depends on the data dependence structure: communication & computation • But many other correct orderings are possible, and some can enable us to hide communication • We can characterize the running program in terms of a task precedence graph • There is a deterministic procedure for translating MPI code into the graph Irecv j Irecv j 0 2 Send j Send j 1 Wait Wait 3 4 Comp Comp SPMD MPI TASK GRAPH Scott B. Baden / CSE 260, UCSD / Fall '15 26
Recommend
More recommend