high performance pipelined process migration with rdma
play

High Performance Pipelined Process Migration with RDMA Xiangyong - PowerPoint PPT Presentation

High Performance Pipelined Process Migration with RDMA Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Dhabaleswar K. (DK) Panda Department of Computer Science & Engineering The Ohio State University Outline Introduction


  1. High Performance Pipelined Process Migration with RDMA Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Dhabaleswar K. (DK) Panda Department of Computer Science & Engineering The Ohio State University

  2. Outline • Introduction and Motivation • Profiling Process Migration • Pipelined Process Migration with RDMA • Performance Evaluation • Conclusions and Future Work CCGrid 2011 2

  3. Motivation • Computer clusters continue to grow larger – Heading towards Multi-PetaFlop and ExaFlop Era – Mean-time-between-failures (MTBF) is getting smaller  Fault-Tolerance becomes imperative • Checkpoint/Restart (C/R) – common approach to Fault Tolerance – Checkpoint: save snapshots of all processes ( IO overhead ) – Restart: restore, resubmit the job ( IO overhead + queue delay ) • C/R Drawbacks × Unnecessarily dump all processes  IO bottleneck × Resubmit queuing delay  Checkpoint/Restart alone doesn’t scale to large systems CCGrid 2011 3

  4. Job/Process Migration • Pro-active Fault Tolerance – Only handle processes on failing node – Health monitoring mechanisms, failure prediction models • Five steps (1) Suspend communication channels (2) Write snapshots on source node (3) Transfer process image files ( Source=>Target ) image files on target node (4) Read (5) Reconnect communication channels CCGrid 2011 4

  5. Process Migration Advantages • Overcomes C/R drawbacks × Unnecessary dump of all processes × Resubmit queuing delay • Desirable feature for other applications – Cluster-wide load balancing – Server consolidation – Performance isolation CCGrid 2011 5

  6. Existing MPI Process Migration • Available in MVAPICH2 and OpenMPI • Both suffers low performance • Cause? Solution? CCGrid 2011 6

  7. Problem Statements • What are the dominant factors of the high cost of process migration? • How to design an efficient protocol to minimize overhead? – How to optimize checkpoint-related I/O path ? – How to optimize data transfer path? – How to leverage RDMA transport to accelerate data transmission? • What will be the performance benefits? CCGrid 2011 7

  8. Outline • Introduction and Motivation • Profiling Process Migration • Pipelined Process Migration with RDMA • Performance Evaluation • Conclusions and Future Work CCGrid 2011 8

  9. MVAPICH/MVAPICH2 Software • MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1) and MVAPICH2 (MPI-2) – Used by more than 1,550 organizations worldwide (in 60 countries) – Empowering many TOP500 clusters (11 th , 15 th … ) – Available with software stacks of many IB, 10GE/iWARP and RoCE, and server vendors including Open Fabrics Enterprise Distribution (OFED) – Available with Redhat and SuSE Distributions – http://mvapich.cse.ohio-state.edu/ • Has support for Checkpoint/Restart and Process Migration for the last several years – Already used by many organizations CCGrid 2011 9

  10. Three Process Migration Approaches • MVAPICH2 already supports three process migration strategies – Local Filesystem-based Migration ( Local ) – Shared Filesystem-based Migration ( Shared ) – RDMA+Local Filesystem-based Migration ( RDMA+Local) CCGrid 2011 10

  11. Local Filesystem-based Process Migration ( Local ) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process (5) Reconnect (1) Suspend (2) Write Network Stack Network Stack (4) Read Memory Memory (3) Transfer VFS Page Cache VFS Page Cache Local Filesystem Local Filesystem Time Write Transfer Read CCGrid 2011 11

  12. Shared Filesystem-based Process Migration ( Shared ) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process (1) Suspend (5) Reconnect VFS Page Cache VFS Page Cache (4) Read Network Stack Network Stack Memory Memory (2) Write + (3) Transfer VFS Page Cache Shared Filesystem Write Read Time Transfer 1 Transfer 2 CCGrid 2011 12

  13. RDMA + Local Filesystem-based Process Migration ( RDMA+Local ) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process (1) Suspend (5) Reconnect (2) Write + (3) Transfer RDMA Buffer Pool (4) Read RDMA Buffer Pool VFS Page Cache Local Filesystem Write Read Time Transfer CCGrid 2011 13

  14. Profiling Process Migration Time Cost • Read image files on target node • Copy checkpoint from source to target • Source node writes checkpoint files  All three approaches suffer from IO cost Migrate 8 processes  Conclusion: All three steps (Write, Transfer, Read) shall be optimized CCGrid 2011 14

  15. Outline • Introduction and Motivation • Profiling Process Migration • Pipelined Process Migration with RDMA • Performance Evaluation • Conclusions and Future Work CCGrid 2011 15

  16. Pipelined Process Migration with RDMA (PPMR) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process (1) Suspend (5) Reconnect FUSE FUSE Buffer Buffer Write Read Manager Manager Transfer RDMA Buffer Pool RDMA Buffer Pool Write Time Transfer Read CCGrid 2011 16

  17. Comparisons Time Local Write Transfer Read Write Read Time Shared Transfer 1 Transfer 2 Write Read Time RDMA+Local Transfer Write Time PPMR Transfer Read CCGrid 2011 17

  18. PPMR Design Strategy  Fully pipelines the three key steps – Write at source node – Transfer checkpoint data to target node – Read process images  Efficient restart mechanism on target node – Restart from RDMA data streams • Design choices – Buffer Pool size, Chunk size CCGrid 2011 18

  19. Outline • Introduction and Motivation • Profiling Process Migration • Pipelined Process Migration with RDMA • Performance Evaluation • Conclusions and Future Work CCGrid 2011 19

  20. Experiment Environment • System setup – Linux cluster • Dual-socket Quad core Xeon processors, 2.33GHz • Nodes are connected by InfiniBand DDR (16Gbps) • Linux 2.6.30, FUSE-2.8.5 • NAS parallel Benchmark suite version 3.2.1 • LU/BT/SP Class C/D input • MVAPICH2 with Job Migration Framework • PPMR • Local, Shared, RDMA+Local CCGrid 2011 20

  21. Raw Data Bandwidth Test (1) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process FUSE FUSE Buffer Buffer Aggregation Manager Manager Bandwidth RDMA Buffer Pool RDMA Buffer Pool CCGrid 2011 21

  22. Aggregation Bandwidth Write Unit Size = 128KB  saturate with 8-16 processes (~800 MB/s)  Bandwidth determined by FUSE ( in-sensitive to buffer pool size )  Chunk size = 128 KB generally the best CCGrid 2011 22

  23. Raw Data Bandwidth Test (2) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process FUSE FUSE Buffer Buffer Manager Manager RDMA Buffer Pool RDMA Buffer Pool Network Transfer Bandwidth CCGrid 2011 23

  24. InfiniBand DDR Bandwidth Chunk size > 16 KB Peak BW = 1450 MB/s InfiniBand DDR (16Gbps ) CCGrid 2011 24

  25. Network Transfer Bandwidth Chunk Size = 128KB  Bandwidth in-sensitive to buffer pool size  8 IO streams can saturate the network CCGrid 2011 25

  26. Raw Data Bandwidth Test (3) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process FUSE FUSE Buffer Buffer Manager Manager RDMA Buffer Pool Pipeline Bandwidth RDMA Buffer Pool CCGrid 2011 26

  27. Pipeline Bandwidth Buffer Pool = 8MB Chunk Size = 128KB  Determined by Aggregation Bandwidth  Chunk size = 128 KB generally the best  Insensitive to buffer pool size CCGrid 2011 27

  28. Time to Complete a Process Migration (Lower is Better) (PPMR : Buffer Pool=8MB, Chunk Size = 128KB) 10.7X 4.3X 2.3X 1 Migrate 8 processes CCGrid 2011 28

  29. Application Execution Time (Lower is Better) +38% +9.2% +5.1% No Migration CCGrid 2011 29

  30. Scalability: Memory Footprint Migration Time of Different Problem Sizes (64 processes on 8 nodes) 10.9X 2.6X 7.3X CCGrid 2011 30

  31. Scalability: IO Multiplexing • Process per Node: 1  4 Better Pipeline bandwidth LU.D with 8/16/32/64 Processes, 8 Compute nodes. Migration data = 1500 MB CCGrid 2011 31

  32. Outline • Introduction and Motivation • Profiling Process Migration • Pipelined Process Migration with RDMA • Performance Evaluation • Conclusions and Future Work CCGrid 2011 32

  33. Conclusions • Process Migration overcomes C/R drawbacks • Process Migration shall be optimized in its IO path • Pipelined Process Migration with RDMA (PPMR) – Pipelines all steps in the IO path CCGrid 2011 33

  34. Software Distribution • The PPMR design has been released in MVAPICH2 1.7 – Downloadable from http://mvapich.cse.ohio-state.edu/ CCGrid 2011 34

  35. Future Work • How PPMR can benefit general cluster applications – Cluster-wide load balancing – Server consolidation • How diskless cluster architecture can utilize PPMR CCGrid 2011 35

  36. Thank you! http://mvapich.cse.ohio-state.edu {ouyangx, rajachan, besseron, panda} @cse.ohio-state.edu Network-Based Computing Laboratory CCGrid 2011 36

Recommend


More recommend