Slide 2 Outline Tutorial: Partitioning, Load Balancing Part 1: and the Zoltan Toolkit • Partitioning and load balancing – “Owner computes” approach • Static vs. dynamic partitioning • Models and algorithms – Geometric (RCB, SFC) Erik Boman and Karen Devine – Graph & hypergraph Discrete Algorithms and Math Dept. Part 2: Sandia National Laboratories, NM • Zoltan – Capabilities CSCAPES Institute – How to get it, configure, build – How to use Zoltan with your application SciDAC Tutorial, MIT, June 2007 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy ’ s National Nuclear Security Administration under contract DE-AC04-94AL85000. Slide 3 Slide 4 Parallel Computing in CS&E Parallel Computing Approaches • Parallel Computing Challenge • We focus on distributed memory systems. – Scientific simulations critical to modern science. – Two common approaches: • Models grow in size, higher fidelity/resolution. • Master–slave • Simulations must be done on parallel computers. – A “master” processor is a global synchronization – Clusters with 64-256 nodes are widely available. point, hands out work to the slaves. – High-performance computers have 100,000+ • Data decomposition + “Owner computes”: processors. – The data is distributed among the processors. • How can we use such machines efficiently? – The owner performs all computation on its data. – Data distribution defines work assignment. – Data dependencies among data items owned by different processors incur communication. 1 1
Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals • Assignment of application data to processors for parallel • Minimize total execution time by… computation. – Minimizing processor idle time. • Applied to grid points, elements, matrix rows, particles, …. • Load balance data and work. – Keeping inter-processor communication low. • Reduce total volume, max volume. • Reduce number of messages. Partition of an unstructured finite element mesh for three processors Slide 7 Slide 8 “Simple” Example (1) “Simple” Example (2) • Finite difference method. • Finite difference method. – Assign equal numbers of grid points to processors. – Assign equal numbers of grid points to processors. – Keep amount of data communicated small. – Keep amount of data communicated small. 3 3 3 3 3 3 3 2 2 2 2 2 2 3 1 1 1 1 2 2 2 0 0 1 1 1 1 1 Max Data Comm: 14 Total Volume: 42 0 0 0 0 0 0 0 7x5 grid Max Nbor Proc: 2 First 35/4 points to processor 0; 5-point stencil Max Imbalance: 3% next 35/4 points to processor 1; etc. 4 processors 2 2
Slide 9 Slide 10 “Simple” Example (3) “Simple” Example (4) • Finite difference method. • Finite difference method. – Assign equal numbers of grid points to processors. – Assign equal numbers of grid points to processors. – Keep amount of data communicated small. – Keep amount of data communicated small. 1 1 1 1 2 2 2 0 0 1 1 2 2 3 1 1 1 1 2 2 2 0 0 1 1 2 2 3 0 0 0 0 3 3 3 0 0 1 1 2 2 3 0 0 0 0 3 3 3 0 0 1 1 2 2 3 Max Data Comm: 7 Max Data Comm: 10 Total Volume: 26 Total Volume: 30 0 0 0 0 3 3 3 0 0 1 1 2 2 3 Max Nbor Proc: 2 Max Nbor Proc: 2 Two-dimensional Max Imbalance: 37% One-dimensional striped partition Max Imbalance: 14% structured grid partition Slide 11 Slide 12 Static Partitioning Dynamic Applications • Characteristics: Initialize Partition Distribute Compute Output Application Data Data Solutions & End – Work per processor is unpredictable or changes during a computation; and/or – Locality of objects changes during computations. – Dynamic redistribution of work is needed during • Static partitioning in an application: computation. – Data partition is computed. – Data are distributed according to partition map. • Example: – Application computes. adaptive mesh • Ideal partition: refinement (AMR) – Processor idle time is minimized. methods – Inter-processor communication costs are kept low. 3 3
Dynamic Repartitioning Static vs. Dynamic: Slide 13 Slide 14 (a.k.a. Dynamic Load Balancing) Usage and Implementation Compute • Static: • Dynamic: Initialize Partition Redistribute Output Solutions Application Data Data & End – Must run side-by-side – Pre-processor to & Adapt with application. application. – Must be implemented in – Can be implemented parallel. serially. • Dynamic repartitioning (load balancing) in an application: – Must be fast, scalable. – May be slow, expensive. – Library application – Data partition is computed. – File-based interface interface required. – Data are distributed according to partition map. acceptable. – Should be easy to use. – Application computes and, perhaps, adapts. – No consideration of – Incremental algorithms – Process repeats until the application is done. existing decomposition preferred. required. • Small changes in input result small changes in • Ideal partition: partitions. – Processor idle time is minimized. • Explicit or implicit – Inter-processor communication costs are kept low. incrementality acceptable. – Cost to redistribute data is also kept low. Recursive Coordinate Slide 15 Slide 16 Two Types of Models/Algorithms Geometric Bisection (RCB) • Developed by Berger & Bokhari (1987) for AMR. • Geometric – Independently discovered by others. – Computations are tied to a geometric domain. 1st cut • Idea: 3rd – Coordinates for data items are available. – Divide work into two equal parts – Geometric locality is loosely correlated to data using a cutting plane orthogonal to a coordinate axis. 3rd dependencies. – Recursively cut the • Combinatorial (topological) resulting subdomains. – No geometry . 2nd – Connectivity among data items is known. • Represent as graph or hypergraph. 2nd 3rd 3rd 4 4
RCB Advantages Slide 17 Slide 18 RCB Repartitioning and Disadvantages • Advantages: • Implicitly incremental. – Conceptually simple; fast and inexpensive. • Small changes in data results in small movement of – Regular subdomains. cuts. • Can be used for structured or unstructured applications. • All processors can inexpensively know entire decomposition. – Effective when connectivity info is not available. • Disadvantages: – No explicit control of communication costs. – Can generate disconnected subdomains. – Mediocre partition quality. – Geometric coordinates needed. Slide 19 Slide 20 Applications of RCB Variations on RCB : RIB • Recursive Inertial Bisection – Simon, Taylor, et al., 1991 – Cutting planes orthogonal to principle axes of geometry. – Not incremental. Particle Simulations Adaptive Mesh Refinement 1.6 ms 3.2 ms Crash Simulations and Contact Detection Parallel Volume Rendering 5 5
Space-Filling Curve Slide 21 Slide 22 Partitioning (SFC) SFC Algorithm • Developed by Peano, 1890. • Run space-filling curve through domain. • Space-Filling Curve: • Order objects according to position on curve. – Mapping between R 3 to R 1 that completely fills a domain. • Perform 1-D partition of curve. – Applied recursively to obtain desired granularity. • Used for partitioning by … 14 14 14 – Warren and Salmon, 1993, gravitational simulations. 12 12 12 13 13 13 – Pilkington and Baden, 1994, smoothed particle 15 15 15 hydrodynamics. 9 9 9 8 16 8 16 16 8 – Patra and Oden, 1995, adaptive mesh refinement. 11 11 11 10 10 10 5 5 5 6 6 6 17 17 17 7 7 7 4 4 4 20 20 20 18 18 18 1 1 1 2 2 2 3 19 3 19 3 19 SFC Advantages Slide 23 Slide 24 SFC Repartitioning and Disadvantages • Advantages: • Implicitly incremental. – Simple, fast, inexpensive. – Maintains geometric locality of objects in • Small changes in data results in small processors. movement of cuts in linear ordering. – Linear ordering of objects may improve cache performance. • Disadvantages: – No explicit control of communication costs. – Can generate disconnected subdomains. – Often lower quality partitions than RCB. – Geometric coordinates needed. 6 6
Recommend
More recommend