g i a v c e i r h s c i t y f o mathematics and computer
play

G I A V C E I R H S C I T Y F O Mathematics and - PDF document

T uto rial on MPI The MessageP assing Interface Willi am Gropp O N A I L T A N L A B E O N R N A O T G O R R A Y U O N G I A V C E I R H S C I T Y F O Mathematics and Computer


  1. Getting sta rted W riting MPI p rograms � Compiling and linking � Running MPI p rograms � Mo re info rmation � � Using MPI b y William Gropp� Ewing Lusk� and Anthony Skjellum� � The LAM companion to �Using MPI���� b y Zdzisla w Meglicki � Designing and Building P a rall el Programs b y Ian F oster� � A T uto rial�User�s Guide fo r MPI b y P eter P acheco �ftp���math�usfca�edu�pub�MPI�mpi�guide�ps� � The MPI standa rd and other info rmation is available at � Also http���ww w� mcs �a nl� go v�m pi the source fo r several implementati ons� ��

  2. W riting MPI p rograms �include �mpi�h� �include �stdio�h� int main� argc� argv � int argc� char ��argv� � MPI�Init� �argc� �argv �� printf� �Hello world�n� �� MPI�Finalize��� return �� � ��

  3. Commenta ry p rovides basic MPI �include �mpi�h� � de�nitions and t yp es sta rts MPI MPI�Init � exits MPI MPI�Finalize � Note that all non�MPI routines a re lo cal� � thus the run on each p ro cess printf ��

  4. Compiling and linking F o r simple p rograms� sp ecial compiler commands can b e used� F o r la rge p rojects� it is b est to use a standa rd Mak e�le� The MPICH implementation p rovides the commands and mpicc mpif�� as w ell as � Makefile � examples in � �usr�local�mpi�examples�Mak efile�in � ��

  5. Sp ecial compilation commands The commands mpicc �o first first�c mpif�� �o firstf firstf�f ma y b e used to build simple p rograms when using MPICH� These p rovide sp ecial options that exploit the p ro�ling features of MPI �mpilog Generate log �les of MPI calls �mpitrace T race execution of MPI calls �mpianim Real�time animation of MPI �not available on all systems� There a re sp eci�c to the MPICH implementati on� other implem entati ons ma y p rovide simil a r commands �e�g�� and on IBM SP��� mpcc mpxlf ��

  6. Using Mak e�les The �le � Makefile�in � is a template Mak e�le� The p rogram �script� � mpireconfig � translates this to a Mak e�le fo r a pa rticula r system� This allo ws y ou to use the same Mak e�le fo r a net w o rk of w o rkstations and a massively pa rallel computer� even when they use di�erent compilers� lib ra ri es� and link er options� mpireconfig Makefile Note that y ou must have � mpireconfig � in y our PATH � ��

  7. Sample Mak e�le�in ����� User configur ab le options ����� ARCH � �ARCH� COMM � �COMM� INSTALL�D IR � �INSTAL L� DI R� CC � �CC� F�� � �F��� CLINKER � �CLINKE R� FLINKER � �FLINKE R� OPTFLAGS � �OPTFLA GS � � LIB�PATH � �L��INS TA LL �D IR �� li b� �� AR CH� �� �C OM M� FLIB�PATH � �FLIB�PAT H� LE AD ER �� �I NS TA LL �D IR �� li b� �� ARC H� �� �C OM M� LIB�LIST � �LIB�LI ST � � INCLUDE�D IR � �INCLUD E� PA TH � �I��INST AL L�D IR �� in cl ud e ��� End User configur ab le options ��� ��

  8. Sample Mak e�le�in �con�t� CFLAGS � �CFLAGS � ��OPTFLA GS � ��INCLUD E�D IR � �DMPI��� AR CH � FFLAGS � �FFLAGS� ��INCLU DE �D IR � ��OPTFLAG S� LIBS � ��LIB�PA TH � ��LIB�LI ST � FLIBS � ��FLIB� PA TH � ��LIB�LI ST � EXECS � hello default� hello all� ��EXECS� hello� hello�o ��INSTAL L� DI R� �i nc lu de �m pi� h ��CLINK ER � ��OPTFLA GS � �o hello hello�o � ��LIB�P AT H� ��LIB�L IS T� �lm clean� �bin�rm �f ��o �� PI� ��EXECS � �c�o� ��CC� ��CFLAG S� �c ���c �f�o� ��F��� ��FFLAGS � �c ���f ��

  9. Running MPI p rograms mpirun �np � hello � mpirun � is not pa rt of the standa rd� but some version of it is common with several MPI implementations� The version sho wn here is fo r the MPICH implementation of MPI� Just as F o rtran do es not sp ecify ho w � F o rtran p rograms a re sta rted� MPI do es not sp ecify ho w MPI p rograms a re sta rted� The option sho ws the commands that �t � w ould execute� y ou can use this to mpirun �nd out ho w sta rts p rograms on y o r mpirun system� The option sho ws all options �help to mpirun � ��

  10. Finding out ab out the environment Tw o of the �rst questions ask ed in a pa rallel p rogram a re� Ho w many p ro cesses a re there� and Who am I� Ho w many is answ ered with MPI�Comm�size and who am I is answ ered with MPI�Comm�rank � The rank is a numb er b et w een zero and size ��� ��

  11. A simple p rogram �include �mpi�h� �include �stdio�h� int main� argc� argv � int argc� char ��argv� � int rank� size� MPI�Init � �argc� �argv �� MPI�Comm �r ank � MPI�COMM�W ORL D� �rank �� MPI�Comm �s ize � MPI�COMM�W ORL D� �size �� printf� �Hello world� I�m �d of �d�n�� rank� size �� MPI�Fina li ze� �� return �� � ��

  12. Caveats These sample p rograms have b een k ept � as simple as p ossible b y assuming that all p ro cesses can do output� Not all pa rallel systems p rovide this feature� and MPI p rovides a w a y to handle this case� ��

  13. Exercise � Getting Sta rted Objective� Lea rn ho w to login� write� compile� and run a simple MPI p rogram� Run the �Hello w o rld� p rograms� T ry t w o di�erent pa rallel computers� What do es the output lo ok lik e� ��

  14. Sending and Receiving messages Process 0 Process 1 A: Send Recv B: Questions� T o whom is data sent� � What is sent� � Ho w do es the receiver identify it� � ��

  15. Current Message�P assing A t ypical blo cking send lo oks lik e � send� dest� type� address� length � where � is an integer identi�er rep resenting the dest p ro cess to receive the message� � is a nonnegative integer that the type destination can use to selectively screen messages� � � address� length � describ es a contiguous a rea in memo ry containing the message to b e sent� and A t ypical global op eration lo oks lik e� � broadcas t� type� address� length � All of these sp eci�cations a re a go o d match to � ha rdw a re� easy to understand� but to o in�exible� ��

  16. The Bu�er Sending and receiving only a contiguous a rra y of b ytes� hides the real data structure from ha rdw a re which � might b e able to handle it directly requires p re�packing disp ersed data � � ro ws of a matrix sto red columnwise � general collections of structures p revents communicati ons b et w een machines with � di�erent rep resentations �even lengths� fo r same data t yp e ��

  17. Generalizing the Bu�er Description Sp eci�ed in MPI b y sta rting address � datat yp e � and � count � where datat yp e is� � elementa ry �all C and F o rtran datat yp es� � contiguous a rra y of datat yp es � strided blo cks of datat yp es � indexed a rra y of blo cks of datat yp es � general structure Datat yp es a re constructed recursively � � Sp eci�cations of elementa ry datat yp es allo ws � heterogeneous communication� Elim i nati on of length in favo r of count is clea rer� � Sp ecifying application�o r i ented la y out of data � allo ws maximal use of sp ecial ha rdw a re� ��

  18. Generalizing the T yp e A single t yp e �eld is to o constraining� Often � overloaded to p rovide needed �exibilit y � Problems� � � under user control � wild ca rds allo w ed � MPI�ANY�T AG � � lib ra ry use con�icts with user and with other lib ra ri es ��

  19. Sample Program using Lib ra ry Calls and a re from di�erent lib ra ri es� Sub� Sub� Sub���� Sub���� and a re from the same lib ra r y Sub�a Sub�b Sub�a��� Sub���� Sub�b��� Thanks to Ma rc Snir fo r the follo wing four examples ��

  20. Co rrect Execution of Lib ra ry Calls Process 0 Process 1 Process 2 recv(any) send(1) Sub1 recv(any) send(0) recv(1) send(0) Sub2 recv(2) send(1) send(2) recv(0) ��

  21. Inco rrect Execution of Lib ra ry Calls Process 0 Process 1 Process 2 recv(any) send(1) Sub1 recv(any) send(0) recv(1) send(0) Sub2 recv(2) send(1) send(2) recv(0) ��

  22. Co rrect Execution of Lib ra ry Calls with P ending Communcication Process 0 Process 1 Process 2 recv(any) send(1) Sub1a send(0) recv(2) send(0) Sub2 send(2) recv(1) send(1) recv(0) recv(any) Sub1b ��

  23. Inco rrect Execution of Lib ra ry Calls with P ending Communication Process 0 Process 1 Process 2 recv(any) send(1) Sub1a send(0) recv(2) send(0) Sub2 send(2) recv(1) send(1) recv(0) recv(any) Sub1b ��

  24. Solution to the t yp e p roblem A sepa rate communicati on context fo r each family � of messages� used fo r queueing and matching� �This has often b een simulated in the past b y overloading the tag �eld�� No wild ca rds allo w ed� fo r securit y � Allo cated b y the system� fo r securit y � T yp es � tags � in MPI� retained fo r no rmal use �wild � ca rds OK� ��

  25. Delimiting Scop e of Communication Sepa rate groups of p ro cesses w o rking on � subp roblems � Merging of p ro cess name space interferes with mo dula ri t y � �Lo cal� p ro cess identi�ers desirable P a rall el invo cation of pa rallel lib ra ri es � � Messages from application must b e k ept sepa rate from messages internal to lib ra r y � � Kno wledge of lib ra ry message t yp es interferes with mo dula rit y � � Synchronizing b efo re and after lib ra ry calls is undesirable� ��

  26. Generalizing the Pro cess Identi�er Collective op erations t ypically op erated on all � p ro cesses �although some systems p rovide subgroups�� This is to o restrictive �e�g�� need minimum over a � column o r a sum across a ro w� of p ro cesses� MPI p rovides groups of p ro cesses � � initial �all� group � group management routines �build� delete groups� All communication �not just collective op erations� � tak es place in groups� A group and a context a re combined in a � communicato r � Source�destination in send�receive op erations refer � to rank in group asso ciated with a given communicato r � p ermitted in a MPI�ANY� SO URC E receive� ��

  27. MPI Basic Send�Receive Thus the basic �blo cking� send has b ecome� MPI�Send� start� count� datatype � dest� tag� comm � and the receive� MPI�Recv�s ta rt� count� datatype� source� tag� comm� status� The source� tag� and count of the message actually received can b e retrieved from status � Tw o simple collective op erations� MPI�Bcast� st art � count� datatype � root� comm� MPI�Reduce �s tar t� result� count� datatype� operation � root� comm� ��

  28. Getting info rmation ab out a message MPI�Stat us status� MPI�Recv � ���� �status �� ��� status�MP I� TAG � ��� status�MP I� SOU RC E� MPI�Get� co unt � �status� datatype � �count �� and p rima ri l y of use when MPI�TAG MPI�SOURC E and�o r in the receive� MPI�ANY� TA G MPI�ANY� SOU RC E ma y b e used to determine ho w much MPI�Get� co unt data of a pa rticula r t yp e w as received� ��

  29. Simple F o rtran example program main include �mpif�h � integer rank� size� to� from� tag� count� i� ierr integer src� dest integer st�sour ce � st�tag� st�count integer status� MP I� ST AT US �S IZ E� double precisio n data���� � call MPI�INIT � ierr � call MPI�COMM �R AN K� MPI�COM M� WO RL D� rank� ierr � call MPI�COMM �S IZ E� MPI�COM M� WO RL D� size� ierr � print �� �Process �� rank� � of �� size� � is alive� dest � size � � src � � C if �rank �eq� src� then to � dest count � �� tag � ���� do �� i��� �� �� data�i� � i call MPI�SEN D� data� count� MPI�DOUBL E� PR EC IS IO N� to� � tag� MPI�COM M� WOR LD � ierr � else if �rank �eq� dest� then tag � MPI�ANY� TA G count � �� from � MPI�ANY� SO UR CE call MPI�REC V� da ta � count� MPI�DOUB LE �P RE CI SI ON � from� � tag� MPI�COMM �W ORL D� status� ierr � ��

  30. Simple F o rtran example �cont�� call MPI�GET �C OU NT � status� MPI�DOUBL E� PR EC IS IO N� � st�count � ierr � st�sourc e � status�M PI �S OU RC E� st�tag � status�M PI �T AG � C print �� �Status info� source � �� st�sourc e� � � tag � �� st�tag� � count � �� st�coun t print �� rank� � receive d� � �data�i�� i� �� �� � endif call MPI�FINA LI ZE � ierr � end ��

  31. Six F unction MPI MPI is very simple� These six functions allo w y ou to write many p rograms� MPI Init MPI Finalize MPI Comm size MPI Comm rank MPI Send MPI Recv ��

  32. A taste of things to come The follo wing examples sho w a C and F o rtran version of the same p rogram� This p rogram computes PI �with a very simple metho d� but do es not use MPI�Send and MPI�Recv � Instead� it uses collective op erations to send data to and from all of the running p ro cesses� This gives a di�erent six�function MPI set� MPI Init MPI Finalize MPI Comm size MPI Comm rank MPI Bcast MPI Reduce ��

  33. Broadcast and Reduction The routine sends data from one MPI�Bcast p ro cess to all others� The routine combines data from MPI�Reduce all p ro cesses �b y adding them in this case�� and returning the result to a single p ro cess� ��

  34. F o rtran example� PI program main include �mpif�h � double precisio n PI��DT paramet er �PI��DT � ��������� �� �� �� �� �� �� �� �� �d �� double precisio n mypi� pi� h� sum� x� f� a integer n� myid� numprocs � i� rc c function to integrat e f�a� � ��d� � ���d� � a�a� call MPI�INIT � ierr � call MPI�COMM �R AN K� MPI�COM M� WO RL D� myid� ierr � call MPI�COMM �S IZ E� MPI�COM M� WO RL D� numprocs � ierr � �� if � myid �eq� � � then write��� �� � �� format�� En te r the number of intervals � �� quits�� � read���� �� n �� format�i �� � endif call MPI�BCAS T� n� �� MP I� IN TE GE R� �� MPI �C OM M� WO RL D� ie rr � ��

  35. F o rtran example �cont�� c check for quit signal if � n �le� � � goto �� c calculate the interva l size h � ���d��n sum � ���d� do �� i � myid��� n� numproc s x � h � �dble�i � � ���d�� sum � sum � f�x� �� continue mypi � h � sum c collect all the partial sums call MPI�RED UC E� my pi �p i� �� MP I� DO UB LE� PR EC IS IO N� MP I� SU M� �� � MPI�COM M� WO RL D� ie rr � c node � prints the answer� if �myid �eq� �� then write�� � ��� pi� abs�pi � PI��DT� �� format� � pi is approxi ma te ly � �� F������ � � Error is� �� F������ endif goto �� �� call MPI�FIN AL IZ E� rc � stop end ��

  36. C example� PI �include �mpi�h� �include �math�h� int main�argc �a rg v� int argc� char �argv !� " int done � �� n� myid� numprocs � i� rc� double PI��DT � ������� �� �� �� �� �� �� �� ��� �� � double mypi� pi� h� sum� x� a� MPI�Init� #a rg c� #a rg v� � MPI�Comm� si ze �M PI �C OM M� WO RL D� #n um pr oc s�� MPI�Comm� ra nk �M PI �C OM M� WO RL D� #m yi d� � ��

  37. C example �cont�� while �$done� " if �myid �� �� " printf� �E nt er the number of interval s� �� quits� ��� scanf�� %d �� #n �� & MPI�Bcast �# n� �� MPI�INT� �� MPI�COMM� WO RL D� � if �n �� �� break� h � ��� � �double � n� sum � ���� for �i � myid � �� i �� n� i �� numprocs � " x � h � ��doubl e� i � ����� sum �� ��� � ���� � x�x�� & mypi � h � sum� MPI�Reduc e� #m yp i� #pi� �� MPI�DOU BL E� MPI�SUM� �� MPI�COM M� WO RL D� � if �myid �� �� printf� �p i is approxi ma te ly %���f� Error is %���f�n� � pi� fabs�pi � PI��DT�� � & MPI�Final iz e� �� & ��

  38. Exercise � PI Objective� Exp eriment with send�receive Run either p rogram fo r PI� W rite new versions that replace the calls to MPI�Bcast and with and MPI�Recv � MPI�Reduce MPI�Send The MPI b roadcast and reduce op erations � use at most log send and receive op erations p on each p ro cess where is the size of p WORLD � Ho w many op erations do MPI COMM y our versions use� ��

  39. Exercise � Ring Objective� Exp eriment with send�receive W rite a p rogram to send a message a round a ring of p ro cesso rs� That is� p ro cesso r � sends to p ro cesso r �� who sends to p ro cesso r �� etc� The last p ro cesso r returns the message to p ro cesso r �� Y ou can use the routine to time MPI Wtime � co de in MPI� The statement t � MPI Wtime��� returns the time as a � DOUBLE double in F o rtran�� PRECISION ��

  40. T op ologies MPI p rovides routines to p rovide structure to collections of p ro cesses This helps to answ er the question� Who a re my neighb o rs� ��

  41. Ca rtesian T op ologies A Ca rtesian top ology is a mesh Example of � � Ca rtesian mesh with a rro ws � p ointing at the right neighb o rs� (0,2) (1,2) (2,2) (3,2) (0,1) (1,1) (2,1) (3,1) (0,0) (1,0) (2,0) (3,0) ��

  42. De�ning a Ca rtesian T op ology The routine creates a Ca rtesian MPI�Cart� cre at e decomp osition of the p ro cesses� with the numb er of dimensions given b y the a rgument� ndim dims��� � � dims��� � � periods��� � �false� periods��� � �false� reorder � �true� ndim � � call MPI�CART�C RE ATE � MPI�COMM�W OR LD� ndim� dims� � periods� reorder� comm�d� ierr � ��

  43. Finding neighb o rs creates a new communicato r with the MPI�Cart �c rea te same p ro cesses as the input communicato r� but with the sp eci�ed top ology � The question� Who a re my neighb o rs� can no w b e answ ered with t � MPI�Cart �sh if call MPI�CART �S HIF T� comm�d� �� �� nbrleft� nbrright� ierr � call MPI�CART �S HIF T� comm�d� �� �� nbrbottom � nbrtop� ierr � The values returned a re the ranks� in the communicato r comm�d � of the neighb o rs shifted b y � � in the t w o dimensions� ��

  44. Who am I� Can b e answ ered with integer coords��� call MPI�COMM �RA NK � comm�d� myrank� ierr � call MPI�CART �CO OR DS� comm�d� myrank� �� � coords� ierr � Returns the Ca rtesian co o rdinates of the calling p ro cess in coords � ��

  45. P a rtitioning When creating a Ca rtesian top ology � one question is �What is a go o d choice fo r the decomp osition of the p ro cesso rs�� This question can b e answ ered with � MPI�Dims �cr ea te integer dims��� dims��� � � dims��� � � call MPI�COMM �SI ZE � MPI�COMM �W ORL D� size� ierr � call MPI�DIMS �CR EA TE� size� �� dims� ierr � ��

  46. Other T op ology Routines MPI contains routines to translate b et w een Ca rtesian co o rdinates and ranks in a communicato r� and to access the p rop erties of a Ca rtesian top ology � The routine allo ws the MPI�Graph�create creation of a general graph top ology � ��

  47. Why a re these routines in MPI� In many pa rallel computer interconnects� some p ro cesso rs a re closer to than others� These routines allo w the MPI implementation to p rovide an o rdering of p ro cesses in a top ology that mak es logical neighb o rs close in the physical interconnect� Some pa rallel p rogrammers ma y rememb er � hyp ercub es and the e�o rt that w ent into assigning no des in a mesh to p ro cesso rs in a hyp ercub e through the use of Grey co des� Many new systems have di�erent interconnects� ones with multiple paths ma y have notions of nea r neighb o rs that changes with time� These routines free the p rogrammer from many of these considerations� The a rgument is reorder used to request the b est o rdering� ��

  48. The p erio ds a rgument Who a re my neighb o rs if I am at the edge of a Ca rtesian Mesh� ? ��

  49. P erio dic Grids Sp ecify this in with MPI�Cart �cr ea te dims��� � � dims��� � � periods��� � �TRUE� periods��� � �TRUE� reorder � �true� ndim � � call MPI�CART�C RE ATE � MPI�COMM�W OR LD� ndim� dims� � periods� reorder� comm�d� ierr � ��

  50. Nonp erio dic Grids In the nonp erio dic case� a neighb o r ma y not exist� This is indicated b y a rank of MPI�PROC�NULL � This rank ma y b e used in send and receive calls in MPI� The action in b oth cases is as if the call w as not made� ��

  51. Collective Communications in MPI Communicati on is co o rdinated among a group of � p ro cesses� Groups can b e constructed �b y hand� with MPI � group�manipul ati on routines o r b y using MPI top ology�de�nition routines� Message tags a re not used� Di�erent � communicato r s a re used instead� No non�blo cking collective op erations� � Three classes of collective op erations� � � synchronization � data movement � collective computation ��

  52. Synchronization MPI�Barrier�comm� � F unction blo cks untill all p ro cesses in � call it� comm ��

  53. Available Collective P atterns P0 A P0 A P1 P1 A Broadcast P2 P2 A P3 P3 A P0 A B C D P0 A Scatter P1 P1 B P2 P2 C Gather P3 P3 D P0 A P0 A B C D P1 B P1 A B C D All gather P2 C P2 A B C D P3 D P3 A B C D P0 A0 A1 A2 A3 P0 A0 B0 C0 D0 P1 B0 B1 B2 B3 All to All P1 A1 B1 C1 D1 P2 C0 C1 C2 C3 P2 A2 B2 C2 D2 P3 D0 D1 D2 D3 P3 A3 B3 C3 D3 Schematic rep resentation of collective data movement in MPI ��

  54. Available Collective Computation P atterns ABCD P0 A P0 Reduce P1 B P1 P2 C P2 P3 D P3 A P0 A P0 P1 B P1 AB Scan ABC P2 C P2 P3 D P3 ABCD Schematic rep resentation of collective data movement in MPI ��

  55. MPI Collective Routines Many routines� � Allgather Allgatherv Allreduce Alltoall Alltoallv Bcast Gather Gatherv Reduce ReduceScatter Scan Scatter Scatterv versions deliver results to all pa rticipating � All p ro cesses� versions allo w the chunks to have di�erent sizes� � V Allreduce� Reduce� ReduceScatter� and Scan tak e � b oth built�in and user�de�ned combination functions� ��

  56. Built�in Collective Computation Op erations MPI Name Op eration Maximum MPI MAX Minimum MPI MIN Pro duct MPI PROD Sum MPI SUM Logical and MPI LAND Logical o r MPI LOR Logical exclusive o r �xo r� MPI LXOR Bit wise and MPI BAND Bit wise o r MPI BOR Bit wise xo r MPI BXOR Maximum value and lo cation MPI MAXLOC Minimum value and lo cation MPI MINLOC ��

  57. De�ning Y our Own Collective Op erations MPI�Op�c re ate �u ser �f unc ti on� commute� op� MPI�Op�f re e�o p� user�fun ct ion �i nve c� inoutvec� len� datatype� The user function should p erfo rm inoutvec �i � � invec�i� op inoutvec� i� � fo r from to len�� � i � can b e non�commutative �e�g�� matrix user�fun ct ion multipl y�� ��

  58. Sample user function F o r example� to create an op eration that has the same e�ect as on F o rtran double p recision MPI�SUM values� use subrouti ne myfunc� invec� inoutvec� len� datatype � integer len� datatype double precision invec�len� � inoutvec�l en� integer i do �� i���len �� inoutvec �i � � invec�i� inoutvec�i � return end T o use� just integer myop call MPI�Op�c re ate � myfunc� �true�� myop� ierr � call MPI�Redu ce � a� b� �� MPI�DOUBL E� PRE CI SON � myop� ��� � The routine destro ys user�functions when MPI�Op�f ree they a re no longer needed� ��

  59. De�ning groups All MPI communication is relative to a communicato r which contains a context and a group � The group is just a set of p ro cesses� ��

  60. Sub dividin g a communicato r The easiest w a y to create communicato rs with new groups is with IT � MPI�COMM�S PL F o r example� to fo rm groups of ro ws of p ro cesses Column 0 1 2 3 4 0 Row 1 2 use MPI�Comm �s pli t� oldcomm� row� �� �newcomm �� T o maintain the o rder b y rank� use MPI�Comm �r ank � oldcomm� �rank �� MPI�Comm �s pli t� oldcomm� row� rank� �newcomm �� ��

  61. Sub dividin g �con�t� Simil a rl y � to fo rm groups of columns� Column 0 1 2 3 4 0 Row 1 2 use MPI�Comm �s pli t� oldcomm� column� �� �newcomm� �� T o maintain the o rder b y rank� use MPI�Comm �r ank � oldcomm� �rank �� MPI�Comm �s pli t� oldcomm� column� rank� �newcomm � �� ��

  62. Manipulating Groups Another w a y to create a communicato r with sp eci�c memb ers is to use � MPI�Comm �c rea te MPI�Comm �c rea te � oldcomm� group� �newcomm �� The group can b e created in many w a ys� ��

  63. Creating Groups All group creation routines create a group b y sp ecifying the memb ers to tak e from an existing group� sp eci�es sp eci�c memb ers � MPI�Group� in cl excludes sp eci�c memb ers � MPI�Group� ex cl and � MPI�Group� ra nge �i ncl MPI�Group� ra nge �e xc l use ranges of memb ers and � MPI�Group� un ion MPI�Group� in ter se cti on creates a new group from t w o existing groups� T o get an existing group� use MPI�Comm �g rou p� oldcomm� �group �� F ree a group with MPI�Grou p� fre e� �group �� ��

  64. Bu�ering issues Where do es data go when y ou send it� One p ossibilit y is� Process 1 Process 2 A: Local Buffer Local Buffer B: The Network ��

  65. Better bu�ering This is not very e�cient� There a re three copies in addition to the exchange of data b et w een p ro cesses� W e p refer Process 1 Process 2 A: B: But this requires that either that MPI�Send not return until the data has b een delivered o r that w e allo w a send op eration to return b efo re completing the transfer� In this case� w e need to test fo r completion later� ��

  66. Blo cking and Non�Blo cking communication So fa r w e have used communication� � blocking � do es not complete until bu�er is empt y MPI Send �available fo r reuse�� � do es not complete until bu�er is full MPI Recv �available fo r use�� Simple� but can b e �unsafe�� � Pro cess � Pro cess � Send��� Send��� Recv��� Recv��� Completion dep ends in general on size of message and amount of system bu�ering� Send w o rks fo r small enough messages but fails � when messages get to o la rge� T o o la rge ranges from zero b ytes to ����s of Megab ytes� ��

  67. Some Solutions to the �Unsafe� Problem Order the op erations mo re ca refully� � Pro cess � Pro cess � Send��� Recv��� Recv��� Send��� Supply receive bu�er at same time as send� with � � MPI Sendrecv Pro cess � Pro cess � Sendrecv��� Sendrecv��� Use non�blo cking op erations� � Pro cess � Pro cess � Isend��� Isend��� Irecv��� Irecv��� W aitall W aitall Use � MPI�Bsend ��

  68. MPI�s Non�Blo cking Op erations Non�blo cking op erations return �immediatel y� �request handles� that can b e w aited on and queried� � MPI Isend�st art � count� datatype� dest� tag� comm� request� � MPI Irecv�st art � count� datatype� dest� tag� comm� request� � MPI Wait�req ues t� status� One can also test without w aiting� MPI�Test� request� flag� status� ��

  69. Multiple completions It is often desirable to w ait on multiple requests� An example is a master�slave p rogram� where the master w aits fo r one o r mo re slaves to send it a message� � MPI Waitall� cou nt � array of requests� array of statuses� � MPI Waitany� cou nt � array of requests� index� status� � MPI Waitsome �in co unt � array of requests� outcount� array of indices� array of statuses � There a re co rresp onding versions of test fo r each of these� The and ma y b e used to MPI WAITSOME MPI TESTSOME � implem ent master�slave algo rithms that p rovide fair access to the master b y the slaves� ��

  70. F airness What happ ens with this p rogram� �include �mpi�h� �include �stdio�h � int main�argc � argv� int argc� char ��argv� " int rank� size� i� buf �!� MPI�Statu s status� MPI�Init� #argc� #argv �� MPI�Comm� ra nk � MPI�COMM �W OR LD � #rank �� MPI�Comm� si ze � MPI�COMM �W OR LD � #size �� if �rank �� �� " for �i��� i������ si ze �� �� i��� " MPI�Rec v� buf� �� MPI�INT � MPI�ANY�S OU RC E� MPI�ANY� TA G� MPI�COM M� WOR LD � #status �� printf� �Msg from %d with tag %d�n�� status� MP I� SO UR CE � status�MP I� TA G �� & & else " for �i��� i����� i��� MPI�Sen d� buf� �� MPI�INT � �� i� MPI�COM M� WO RL D �� & MPI�Final iz e� �� return �� & ��

  71. F airness in message�passing An pa rallel algo rithm is fair if no p ro cess is e�ectively igno red� In the p receeding p rogram� p ro cesses with lo w rank �lik e p ro cess zero� ma y b e the only one whose messages a re received� MPI mak es no gua rentees ab out fairness� Ho w ever� MPI mak es it p ossible to write e�cient� fair p rograms� ��

  72. Providing F airness One alternative is �define large ��� MPI�Reque st request s la rg e! � MPI�Statu s statuse s la rg e! � int indices l ar ge !� int buf lar ge !� for �i��� i�size� i��� MPI�Irecv � buf�i� �� MPI�INT� i� MPI�ANY� TA G� MPI�COM M� WO RLD � #request s i� �! �� while�not done� " MPI�Waits om e� size��� request s� #ndone� indices� statuse s �� for �i��� i�ndone � i��� " j � indices i !� printf� �Msg from %d with tag %d�n�� statuse s i! �M PI �S OU RC E� statuse s i! �M PI �T AG �� MPI�Ire cv � buf�j� �� MPI�INT� j� MPI�ANY� TA G� MPI�COM M�W OR LD � #request s j! �� & & ��

  73. Providing F airness �F o rtran� One alternative is paramete r� large � ��� � integer requests �l ar ge �� integer statuses �M PI �S TA TU S� SI ZE �l ar ge� � integer indices� la rg e� � integer buf�larg e� � logical done do �� i � ��size�� �� call MPI�Ire cv � buf�i�� �� MPI�INT EG ER� i� � MPI�ANY� TA G� MPI�COM M� WO RLD � requests �i �� ierr � �� if ��not� done� then call MPI�Wait so me � size��� requests � ndone� indices� statuse s� ierr � do �� i��� ndone j � indices �i � print �� �Msg from �� statuse s� MPI �S OU RC E� i� � � with tag�� � statuses �M PI �T AG �i � call MPI�Irec v� buf�j�� �� MPI�INTEG ER � j� MPI�ANY� TA G� MPI�COM M�W OR LD � requests �j �� ierr � done � ��� �� continue goto �� endif ��

  74. Exercise � F airness Objective� Use nonblo cking communications Complete the p rogram fragment on �p roviding fairness�� Mak e sure that y ou leave no uncompleted requests� Ho w w ould y ou test y our p rogram� ��

  75. Mo re on nonblo cking communication In applications where the time to send data b et w een p ro cesses is la rge� it is often helpful to cause communicati on and computation to overlap� This can easily b e done with MPI�s non�blo cking routines� F o r example� in a ��D �nite di�erence mesh� moving data needed fo r the b ounda ries can b e done at the same time as computation on the interio r� MPI�Irec v� ��� each ghost edge ��� �� MPI�Isen d� ��� data for each ghost edge ��� �� ��� compute on interior while �still some uncomplete d requests� � MPI�Waita ny � ��� requests ��� � if �request is a receive� ��� compute on that edge ��� � Note that w e call several times� This MPI�Waita ny exploits the fact that after a request is satis�ed� it is set to L � and that this is a valid MPI�REQU ES T�N UL request object to the w ait and test routines� ��

  76. Communication Mo des MPI p rovides mulitple mo des fo r sending messages� Synchronous mo de � MPI Ssend �� the send do es not � complete until a matching receive has b egun� �Unsafe p rograms b ecome inco rrect and usually deadlo ck within an d �� MPI�Ssen Bu�ered mo de � MPI Bsend �� the user supplies the � bu�er to system fo r its use� �User supplies enough memo ry to mak e unsafe p rogram safe�� Ready mo de � MPI Rsend �� user gua rantees that � matching receive has b een p osted� � allo ws access to fast p roto cols � unde�ned b ehavio r if the matching receive is not p osted Non�blo cking versions� Issend � Irsend � MPI MPI MPI Ibsend Note that an ma y receive messages sent with MPI�Recv any send mo de� ��

  77. Bu�ered Send MPI p rovides a send routine that ma y b e used when is a wkw a rd to use �e�g�� lots of small MPI�Isen d messages�� mak es use of a user�p rovided bu�er to save MPI�Bsen d any messages that can not b e immedi atel y sent� int bufsize� char �buf � malloc�b ufs iz e�� MPI�Buff er �at ta ch� buf� bufsize �� ��� MPI�Bsen d� ��� same as MPI�Send ��� �� ��� MPI�Buff er �de ta ch� �buf� �bufsize �� The call do es not complete until all MPI�Buffe r� det ac h messages a re sent� The p erfo rmance of dep ends on the MPI Bsend � implem entati on of MPI and ma y also dep end on the size of the message� F o r example� making a message one b yte longer ma y cause a signi�cant drop in p erfo rmance� ��

  78. Reusing the same bu�er Consider a lo op MPI�Buff er �at ta ch� buf� bufsize �� while ��done� � ��� MPI�Bsend � ��� �� � where the is la rge enough to hold the message in buf the d � This co de ma y fail b ecause the MPI�Bsen � void �buf� int bufsize� MPI�Buff er �de ta ch� �buf� �bufsize �� MPI�Buff er �at ta ch� buf� bufsize �� � ��

  79. Other P oint�to�P oint F eatures MPI�SENDRECV � MPI�SENDRECV�REPLACE � MPI�CANCEL � P ersistent communication requests � ���

Recommend


More recommend