High Performance Pipelined Process Migration with RDMA Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Dhabaleswar K. (DK) Panda Department of Computer Science & Engineering The Ohio State University
Outline • Introduction and Motivation • Profiling Process Migration • Pipelined Process Migration with RDMA • Performance Evaluation • Conclusions and Future Work CCGrid 2011 2
Motivation • Computer clusters continue to grow larger – Heading towards Multi-PetaFlop and ExaFlop Era – Mean-time-between-failures (MTBF) is getting smaller Fault-Tolerance becomes imperative • Checkpoint/Restart (C/R) – common approach to Fault Tolerance – Checkpoint: save snapshots of all processes ( IO overhead ) – Restart: restore, resubmit the job ( IO overhead + queue delay ) • C/R Drawbacks × Unnecessarily dump all processes IO bottleneck × Resubmit queuing delay Checkpoint/Restart alone doesn’t scale to large systems CCGrid 2011 3
Job/Process Migration • Pro-active Fault Tolerance – Only handle processes on failing node – Health monitoring mechanisms, failure prediction models • Five steps (1) Suspend communication channels (2) Write snapshots on source node (3) Transfer process image files ( Source=>Target ) image files on target node (4) Read (5) Reconnect communication channels CCGrid 2011 4
Process Migration Advantages • Overcomes C/R drawbacks × Unnecessary dump of all processes × Resubmit queuing delay • Desirable feature for other applications – Cluster-wide load balancing – Server consolidation – Performance isolation CCGrid 2011 5
Existing MPI Process Migration • Available in MVAPICH2 and OpenMPI • Both suffers low performance • Cause? Solution? CCGrid 2011 6
Problem Statements • What are the dominant factors of the high cost of process migration? • How to design an efficient protocol to minimize overhead? – How to optimize checkpoint-related I/O path ? – How to optimize data transfer path? – How to leverage RDMA transport to accelerate data transmission? • What will be the performance benefits? CCGrid 2011 7
Outline • Introduction and Motivation • Profiling Process Migration • Pipelined Process Migration with RDMA • Performance Evaluation • Conclusions and Future Work CCGrid 2011 8
MVAPICH/MVAPICH2 Software • MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1) and MVAPICH2 (MPI-2) – Used by more than 1,550 organizations worldwide (in 60 countries) – Empowering many TOP500 clusters (11 th , 15 th … ) – Available with software stacks of many IB, 10GE/iWARP and RoCE, and server vendors including Open Fabrics Enterprise Distribution (OFED) – Available with Redhat and SuSE Distributions – http://mvapich.cse.ohio-state.edu/ • Has support for Checkpoint/Restart and Process Migration for the last several years – Already used by many organizations CCGrid 2011 9
Three Process Migration Approaches • MVAPICH2 already supports three process migration strategies – Local Filesystem-based Migration ( Local ) – Shared Filesystem-based Migration ( Shared ) – RDMA+Local Filesystem-based Migration ( RDMA+Local) CCGrid 2011 10
Local Filesystem-based Process Migration ( Local ) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process (5) Reconnect (1) Suspend (2) Write Network Stack Network Stack (4) Read Memory Memory (3) Transfer VFS Page Cache VFS Page Cache Local Filesystem Local Filesystem Time Write Transfer Read CCGrid 2011 11
Shared Filesystem-based Process Migration ( Shared ) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process (1) Suspend (5) Reconnect VFS Page Cache VFS Page Cache (4) Read Network Stack Network Stack Memory Memory (2) Write + (3) Transfer VFS Page Cache Shared Filesystem Write Read Time Transfer 1 Transfer 2 CCGrid 2011 12
RDMA + Local Filesystem-based Process Migration ( RDMA+Local ) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process (1) Suspend (5) Reconnect (2) Write + (3) Transfer RDMA Buffer Pool (4) Read RDMA Buffer Pool VFS Page Cache Local Filesystem Write Read Time Transfer CCGrid 2011 13
Profiling Process Migration Time Cost • Read image files on target node • Copy checkpoint from source to target • Source node writes checkpoint files All three approaches suffer from IO cost Migrate 8 processes Conclusion: All three steps (Write, Transfer, Read) shall be optimized CCGrid 2011 14
Outline • Introduction and Motivation • Profiling Process Migration • Pipelined Process Migration with RDMA • Performance Evaluation • Conclusions and Future Work CCGrid 2011 15
Pipelined Process Migration with RDMA (PPMR) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process (1) Suspend (5) Reconnect FUSE FUSE Buffer Buffer Write Read Manager Manager Transfer RDMA Buffer Pool RDMA Buffer Pool Write Time Transfer Read CCGrid 2011 16
Comparisons Time Local Write Transfer Read Write Read Time Shared Transfer 1 Transfer 2 Write Read Time RDMA+Local Transfer Write Time PPMR Transfer Read CCGrid 2011 17
PPMR Design Strategy Fully pipelines the three key steps – Write at source node – Transfer checkpoint data to target node – Read process images Efficient restart mechanism on target node – Restart from RDMA data streams • Design choices – Buffer Pool size, Chunk size CCGrid 2011 18
Outline • Introduction and Motivation • Profiling Process Migration • Pipelined Process Migration with RDMA • Performance Evaluation • Conclusions and Future Work CCGrid 2011 19
Experiment Environment • System setup – Linux cluster • Dual-socket Quad core Xeon processors, 2.33GHz • Nodes are connected by InfiniBand DDR (16Gbps) • Linux 2.6.30, FUSE-2.8.5 • NAS parallel Benchmark suite version 3.2.1 • LU/BT/SP Class C/D input • MVAPICH2 with Job Migration Framework • PPMR • Local, Shared, RDMA+Local CCGrid 2011 20
Raw Data Bandwidth Test (1) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process FUSE FUSE Buffer Buffer Aggregation Manager Manager Bandwidth RDMA Buffer Pool RDMA Buffer Pool CCGrid 2011 21
Aggregation Bandwidth Write Unit Size = 128KB saturate with 8-16 processes (~800 MB/s) Bandwidth determined by FUSE ( in-sensitive to buffer pool size ) Chunk size = 128 KB generally the best CCGrid 2011 22
Raw Data Bandwidth Test (2) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process FUSE FUSE Buffer Buffer Manager Manager RDMA Buffer Pool RDMA Buffer Pool Network Transfer Bandwidth CCGrid 2011 23
InfiniBand DDR Bandwidth Chunk size > 16 KB Peak BW = 1450 MB/s InfiniBand DDR (16Gbps ) CCGrid 2011 24
Network Transfer Bandwidth Chunk Size = 128KB Bandwidth in-sensitive to buffer pool size 8 IO streams can saturate the network CCGrid 2011 25
Raw Data Bandwidth Test (3) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process FUSE FUSE Buffer Buffer Manager Manager RDMA Buffer Pool Pipeline Bandwidth RDMA Buffer Pool CCGrid 2011 26
Pipeline Bandwidth Buffer Pool = 8MB Chunk Size = 128KB Determined by Aggregation Bandwidth Chunk size = 128 KB generally the best Insensitive to buffer pool size CCGrid 2011 27
Time to Complete a Process Migration (Lower is Better) (PPMR : Buffer Pool=8MB, Chunk Size = 128KB) 10.7X 4.3X 2.3X 1 Migrate 8 processes CCGrid 2011 28
Application Execution Time (Lower is Better) +38% +9.2% +5.1% No Migration CCGrid 2011 29
Scalability: Memory Footprint Migration Time of Different Problem Sizes (64 processes on 8 nodes) 10.9X 2.6X 7.3X CCGrid 2011 30
Scalability: IO Multiplexing • Process per Node: 1 4 Better Pipeline bandwidth LU.D with 8/16/32/64 Processes, 8 Compute nodes. Migration data = 1500 MB CCGrid 2011 31
Outline • Introduction and Motivation • Profiling Process Migration • Pipelined Process Migration with RDMA • Performance Evaluation • Conclusions and Future Work CCGrid 2011 32
Conclusions • Process Migration overcomes C/R drawbacks • Process Migration shall be optimized in its IO path • Pipelined Process Migration with RDMA (PPMR) – Pipelines all steps in the IO path CCGrid 2011 33
Software Distribution • The PPMR design has been released in MVAPICH2 1.7 – Downloadable from http://mvapich.cse.ohio-state.edu/ CCGrid 2011 34
Future Work • How PPMR can benefit general cluster applications – Cluster-wide load balancing – Server consolidation • How diskless cluster architecture can utilize PPMR CCGrid 2011 35
Thank you! http://mvapich.cse.ohio-state.edu {ouyangx, rajachan, besseron, panda} @cse.ohio-state.edu Network-Based Computing Laboratory CCGrid 2011 36
Recommend
More recommend