mpi review computational example
play

MPI Review Computational Example Common approach for grid-based - PowerPoint PPT Presentation

Lecture 9: Distributed Memory More on MPI Communication costs Hardware / Network topologies from Rauber and Runger MPI Review Computational Example Common approach for grid-based computations on distributed memory uses domain


  1. Lecture 9: Distributed Memory ∗ • More on MPI • Communication costs • Hardware / Network topologies ∗ from Rauber and Runger

  2. MPI Review

  3. Computational Example Common approach for grid-based computations on distributed memory uses domain decomposition: split domain into smaller problems on subdomains and iterate on each. Coordinate the solution between adjacent subdomains. • This approach to parallelism works well for problems that exhibit locality - nearby objects interact more strongly than distant ones. (Same for good cache performance) • Stencil-based finite difference equations are good candidates.

  4. Jacobi Example with MPI Suppose n by n domain, p processors Left: sends n gridpoints to top and bottom. Total = 2 pn . Right: each side= n / √ p . Total = 4 pn / √ p

  5. MPI Jacobi Elements compute start,end from rank; /* update interior grid points */ for (i=istart;i<iend;i++) for (j=jstart;j<jend;j++a){ (x,y) = fn(i,j); update u(i,j); compute norm of update; } /* update ghost cells */ MPI_Sendrecv (to proc to the left); MPI_Sendrecv (to proc to the right); MPI_Sendrecv (to proc on top); MPI_Sendrecv (to proc on bottom); MPI_Allreduce - compute global updateNorm;

  6. MPI Cartesian Communicator #define UP 0 #define DOWN 1 0 1 2 3 #define LEFT 2 (0,0) (0,1) (0,2) (0,3) #define RIGHT 3 4 5 6 7 (1,0) (1,1) (1,2) (1,3) MPI_COMM comm2d; 8 9 10 11 (2,0) (2,1) (2,2) (2,3) int dims[2]={3, 4}, nbrs[4]; int reorder = 0, coords[2]; int periods[2] = {x_periodic,y_periodic}; /* 0 or 1 */ MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, reorder, &comm2d); MPI_Comm_rank(comm2d, &my_rank); MPI_Cart_coords(comm2d, my_rank, 2, coords); MPI_Cart_shift(comm2d, 0, 1,&nbrs[UP],&nbrs[DOWN]); MPI_Cart_shift(comm2d, 1, 1,&nbrs[LEFT],&nbrs[RIGHT]); Then communicate using nbor[UP], nbor[LEFT], etc;

  7. MPE MPI Multi-Processing Environment = package of MPI tools including • profiling libraries, event logging, and convenient wrappers (use mpecc -mpilog for logging) • Jumpshot viewer for logfiles • graphics, debugging routines, more Works with any compliant MPI implementation (MPICH and OpenMPI), distributed with MPICH. Current version MPE2 comes with MPICH2, or can download standalone.

  8. Sample MPI Bugs Use of wildcards could lead to race condition. MPI Bcast need not be synchronizing. If it is, rank 0 gets msg. from rank 1 first. If not, rank 0 could receive msg. from either rank 1 or rank 2 first.

  9. Sample MPI Bugs Only works for even number of processors.

  10. Sample MPI Bugs Supose have local variable, e.g. energy, and want to sum all the processors energy to find total energy of the system. Recall MPI_Reduce(sendbuf,recvbuf,count,datatype,op, root,comm) Using the same variable, as in MPI_Reduce(energy,energy,1 MPI_REAL,MPI_SUM, MPI_COMM_WORLD) will bomb.

  11. Sample MPI Bugs while (stillIterating){ if (my_rank==0){ for (i=1;i<nProcs;i+){ MPI_Recv(buf, count, type, MPI_ANY_SOURCE, tag,MPI_COMM_WORLD,&status) /* process stuff from other processors */ } else { MPI_Send(buf,count,type,0,tag,MPI_COMM_WORLD) } }

  12. MPI + OpenMP Example

  13. MPI + OpenMP Example Sample laptop output: % mpirun -np 2 a.out Hello from MPI rank 0 of 2 Hello from MPI rank 1 of 2 Number of threads 4 on process rank 0 Number of threads 4 on process rank 1

  14. MPI References • Lawrence Livermore tutorial https:computing.llnl.gov/tutorials/mpi/ • Using MPI Portable Parallel Programming with the Message=Passing Interface by Gropp, Lusk, Skjellum • Using MPI-2 Advanced Features of the Message Passing Interface by Gropp, Lusk, Thakur • Lots of other on-line tutorials, books, etc.

  15. Parallel Performance Recall Amdahl’s law: if T 1 = serial cost + parallel cost then T p = serial cost + parallel cost/p But really T p = serial cost + parallel cost/p + T communication How expensive is it?

  16. Network Characteristics Interconnection network connects nodes, transfers data Important qualities: • Topology - the structure used to connect the nodes • Routing algorithm - how messages are transmitted between processors, along which path (= nodes along which message transferred). • Switching strategy = how message is cut into pieces and assigned a path • Flow control (for dealing with congestion) - stall, store data in buffers, re-route data, tell source to halt, discard, etc.

  17. Interconnection Network Represent as graph G = ( V , E ) , V = set of nodes to be connected, E = direct links between the nodes. Links usually bidirectional - transfer msg in both directions at same time. Characterize network by: • diameter - maximum over all pairs of nodes of the shortest path between the nodes (length of path in message transmission) • degree - number of direct links for a node (number of direct neighbors) • bisection bandwidth - minimum number of edges that must be removed to partition network into two parts of equal size with no connection between them. (measures network capacity for transmitting messages simultaneously) • node/edge connectivity - numbers of node/edges that must fail to disconnect the network (measure of reliability)

  18. Linear Array • p vertices, p − 1 links • Diameter = p − 1 • Degree = 2 • Bisection bandwidth = 1 • Node connectivity = 1, edge connectivity = 1

  19. Ring topology • diameter = p/2 • degree = 2 • bisection bandwidth = 2 • node connectivity = 2 edge connectivity = 2

  20. Mesh topology • diameter = 2 ( √ p − 1 ) √ p − 1 ) 3d mesh is 3 ( 3 • degree = 4 (6 in 3d ) • bisection bandwidth √ p • node connectivity 2 edge connectivity 2 Route along each dimension in turn

  21. Torus topology Diameter halved, Bisection bandwidth doubled, Edge and Node connectivity doubled over mesh

  22. Hypercube topology 0110 0111 0 1 0010 0011 110 111 1110 1111 10 11 010 011 1010 1011 0100 0101 1100 1101 100 101 1000 1001 0000 0001 00 01 000 001 • p = 2 k processors labelled with binary numbers of length k • k -dimensional cube constructed from two ( k − 1 ) -cubes • Connect corresponding procs if labels differ in 1 bit (Hamming distance d between 2 k -bit binary words = path of length d between 2 nodes)

  23. Hypercube topology 0110 0111 0 1 0010 0011 110 111 1110 1111 10 11 010 011 1010 1011 0100 0101 1100 1101 100 101 1000 1001 0000 0001 00 01 000 001 • diameter = k ( =log p) • degree = k • bisection bandwidth = p/2 • node connectivity k edge connectivity k

  24. Dynamic Networks Above networks were direct, or static interconnection networks = processors connected directly with each through fixed physical links. Indirect or dynamic networks = contain switches which provide an indirect connection between the nodes. Switches configured dynamically to establish a connection. • bus • crossbar • multistage network - e.g. butterfly, omega, baseline

  25. Crossbar P1 P2 Pn M1 M2 Mm • Connecting n inputs and m outputs takes nm switches. (Typically only for small numbers of processors) • At each switch can either go straight or change dir. • Diameter = 1, bisection bandwidth = p

  26. Butterfly 16 × 16 butterfly network: stage 0 stage 1 stage 2 stage 3 000 001 010 011 100 101 110 111 for p = 2 k + 1 processors, k + 1 stages, 2 k switches per stage, 2 × 2 switches

  27. Fat tree • Complete binary tree • Processors at leaves • Increase links for higher bandwidth near root

  28. Current picture • Old style: mapped algorithms to topologies • New style: avoid topology-specific optimizations • Want code that runs on next year’s machines too. • Topology awareness in vendor MPI libraries? • Software topology - easy of programming, but not used for performance?

  29. Top500 Interconnects T Sta Se H

  30. Networks ∗ ∗ from Top500 2007 recent supercomputers overview

  31. Routing and Switching Routing = determine a path from source to destination through the network. Avoid deadlock. Try to find a minimum cost path, depends on length of path (topology), contention, congestion. • Deterministic routing - always uses same path, can have unbalanced network load, network contention (2 or more messages transmitting at the same time over the same link, leading to delay in msg. transmissions). • Adaptive routing - dynamically select routing path based on load information. Try to spread network traffic evenly. Also more fault tolerant.

  32. Routing and Switching • Circuit switching: full path reserved for entire message (like telephone) short probe msg. establishes path; all message units use same path. • Packet switching: message broken into separately-routed packets • store-and-forward - entire packet must be received by each switch on path (Store) before sent to next switch (forward). Needs enough memory at switches. • pipelining - packets sent so that all links used by succesive packets in an overlapping way (if all packets transmitted along same path) • Cut-through routing, Wormhole routing

Recommend


More recommend