how i learned to
play

How I Learned to Rohit Zambre,* Stop Aparna - PowerPoint PPT Presentation

How I Learned to Rohit Zambre,* Stop Aparna Chandramowlishwaran,* Worrying Pavan Balaji About *University of California, Irvine User-Visible Argonne National Laboratory Endpoints and Love MPI 2 HOW I LEARNED TO STOP WORRYING


  1. How I Learned to Rohit Zambre,* Stop Aparna Chandramowlishwaran,* Worrying Pavan Balaji ⌃ About *University of California, Irvine User-Visible ⌃ Argonne National Laboratory Endpoints and Love MPI

  2. 2 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI MPI everywhere Node Core Process

  3. 3 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI MPI everywhere ▸ Model artifact : high memory requirements that worsen with increase domain- dimensionality and number of ranks. ▸ Hardware usage : resource wastage with static split of limited resources on processor Node Core Process Increasing number of cores Decreasing memory per core

  4. 4 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI MPI+threads ▸ Model artifact : reduces duplicated data by a factor of number of threads. ▸ Hardware usage : able to use the many cores while sharing all of processor’s resources. Node Core Process Thread Increasing number of cores Decreasing memory per core

  5. 5 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 7 Category Computation 6 Corresponding Allgatherv AlltoAllv MPI everywhere 5 Time (seconds) MPI+threads 4 OOM! 3 runs 2 1 0 1x440x110 1x220x220 1x110x440 6x180x45 6x90x90 6x45x180 Processor Grid (threads x processor rows x processor columns) Buluc et al. Distributed BFS (https://arxiv.org/abs/1705.04590)

  6. 6 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 7 Category Computation 6 Corresponding Allgatherv AlltoAllv MPI everywhere 5 Time (seconds) MPI+threads 4 OOM! 3 runs 2 1 0 1x440x110 1x220x220 1x110x440 6x180x45 6x90x90 6x45x180 Processor Grid (threads x processor rows x processor columns) Buluc et al. Distributed BFS (https://arxiv.org/abs/1705.04590) MPI_Isend (8 B) 60 Communication Million Messages/s performance of 40 Model MPI everywhere MPI+threads is MPI+threads (MPI_THREAD_FUNNELED) MPI+threads (MPI_THREAD_MULTIPLE) 20 dismal 0 1 2 4 8 16 Number of cores

  7. 7 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI Node Outdated view : Network is a single device Modern reality : Network features parallelism Network Interface Card Network Interface Card Network hardware context

  8. 8 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI MPI everywhere MPI+threads P0 P1 P2 P3 P0 Application MPI library Network Interface Card Network Interface Card Software communication channel Network hardware context

  9. 9 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI MPI everywhere MPI+threads P0 P1 P2 P3 P0 Application MPI library Network Interface Card Network Interface Card Global critical section + 1 communication channel per process Software communication channel Network hardware context

  10. 10 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI No logical MPI everywhere MPI+threads parallelism expressed P0 P1 P2 P3 P0 Application MPI library Network Interface Card Network Interface Card Global critical section + 1 communication channel per process Software communication channel Network hardware context

  11. 11 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI MPI_Comm_create_endpoints(…,num_ep,…,comm_eps[]); MPI_Isend/Irecv(…,comm_eps[tid],ep_rank,…); MPI process EP0 EP2 EP3 EP4 MPI library Network Interface Card MPI Communicator MPI Endpoint Software communication channel Network hardware context

  12. 12 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI MPI_Comm_create_endpoints(…,num_ep,…,comm_eps[]); MPI_Isend/Irecv(…,comm_eps[tid],ep_rank,…); MPI process Pros ▸ Explicit control over network contexts EP0 EP2 EP3 EP4 Cons MPI library ▸ Intrusive extension of the MPI standard ▸ Onus of managing network contexts on user Network Interface Card MPI Communicator MPI Endpoint Software communication channel Network hardware context

  13. 13 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI Logical parallelism MPI everywhere MPI+threads expressed P0 P0 P1 P2 P3 Application C0 C1 C2 C3 MPI library Network Interface Card Network Interface Card Fine-grained critical sections + Multiple communication channel per process MPI Communicator Software communication channel Network hardware context

  14. 14 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI Do we need user-visible endpoints? Logical parallelism MPI everywhere MPI+threads expressed P0 P0 P1 P2 P3 Application C0 C1 C2 C3 MPI library Network Interface Card Network Interface Card Fine-grained critical sections + Multiple communication channel per process MPI Communicator Software communication channel Network hardware context

  15. 15 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI CONTRIBUTIONS AS DEVIL’S ADVOCATE ▸ In-depth comparison between MPI-3.1 and user-visible endpoints ▸ A fast MPI+threads library that adheres to MPI-3.1’s constraints ▸ Optimized parallel communication streams applicable to all MPI libraries ▸ Recommendations for the MPI user to express logical parallelism with MPI-3.1 MPI library Interconnects Evaluation ▸ Based on ▸ Intel Omni-Path (OPA) with OFI:PSM2 platforms MPICH:CH4 ▸ Mellanox InfiniBand (IB) with UCX:Verbs

  16. 16 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI OUTLINE ▸ Introduction ▸ For MPI users: Parallelism in the MPI standard ▸ For MPI developers: Fast MPI+threads ▸ Fine-grained critical sections for thread safety ▸ Virtual Communication Interfaces (VCIs) for parallel communication streams ▸ Microbenchmark and Application analysis

  17. 17 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI POINT-TO-POINT COMMUNICATION ▸ <comm,rank,tag> decides matching ▸ Non-overtaking order ▸ Receive wildcards Two or more operations on a Can be issued on parallel process with communication streams? Comm Rank Tag Send Recv

  18. 18 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI POINT-TO-POINT COMMUNICATION ▸ <comm,rank,tag> decides matching Rank 0 (sender) ▸ Non-overtaking order ▸ Receive wildcards Two or more operations on a Can be issued on parallel <CA,R1,T1> <CB,R1,T1> process with communication streams? Comm Rank Tag Send Recv <CA,R0,T1> <CB,R0,T1> Different or Different or Different Yes Yes Same Same Rank 1 (receiver)

  19. 19 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI POINT-TO-POINT COMMUNICATION ▸ <comm,rank,tag> decides matching Rank 0 (sender) ▸ Non-overtaking order ▸ Receive wildcards Two or more operations on a Can be issued on parallel <CA,R1,T1> <CA,R2,T1> process with communication streams? Comm Rank Tag Send Recv <CA,R0,T1> <CA,ANY,T1> Different or Different or Different Yes Yes Same Same Different or Same Different Yes No Rank 1 (receiver) Same Wildcards

  20. 20 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI POINT-TO-POINT COMMUNICATION Rank 0 (sender) ▸ <comm,rank,tag> decides matching ▸ Non-overtaking order ▸ Receive wildcards <CA,R1,T1> Two or more operations on a Can be issued on parallel process with communication streams? <CA,R1,T2> Comm Rank Tag Send Recv Different or Different or Different Yes Yes <CA,R0,T3> <CA,R0,ANY> Same Same Different or Same Different Yes No Same Wildcards Different or Rank 1 (receiver) Same Same No No Same Non-overtaking order

  21. 21 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI RMA COMMUNICATION Two or more operations Can be issued on parallel communication streams? on a process with Window Rank Put Get Accumulate Different or Different Yes Yes Yes Same Same Different Yes Yes Yes Same Same Yes Yes No

  22. 22 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI RMA COMMUNICATION Two or more operations Can be issued on parallel communication streams? on a process with Explicitly expressing Window Rank Put Get Accumulate parallelism Different or Different Yes Yes Yes Same Same Different Yes Yes Yes Implicit parallelism Same Same Yes Yes No No order between multiple Gets and Puts

  23. 23 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI RMA COMMUNICATION Two or more operations Can be issued on parallel communication streams? on a process with Explicitly expressing Window Rank Put Get Accumulate parallelism Different or Different Yes Yes Yes Same Same Different Yes Yes Yes Implicit parallelism Same Same Yes Yes No Ordering of accumulate No order between operations to the same multiple Gets and Puts memory location

  24. 24 HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI OUTLINE ▸ Introduction ▸ For MPI users: Parallelism in the MPI standard ▸ For MPI developers: Fast MPI+threads ▸ Fine-grained critical sections for thread safety ▸ Virtual Communication Interfaces (VCIs) for parallel communication streams ▸ Microbenchmark and Application analysis

Recommend


More recommend