HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi † , Sam H. Noh, and Young-ri Choi †
Contents ▪ Motivation & Background ▪ HetPipe in a Nutshell ▪ Our System: HetPipe ▪ Evaluation ▪ Conclusion 2
Motivation ▪ DNN (Deep Neural Network) models continue to grow • Need more powerful GPUs for training! 3
Motivation ▪ Short release cycle of new GPU architectures Whimpy GPUs • Use of heterogeneous GPUs is inevitable! • What to do with whimpy GPUs? 4
DNN Training Forward Pass 𝒋 Cat? Loss Weight Parameter 𝒙 Minibatch 𝒋 (Training Data) Backward Pass 𝒋 𝒙 𝒋+𝟐 = 𝒙 𝒋 − 𝜽 ∙ 𝒗 𝒋 5
Parallelizing DNN Training ▪ Data parallelism (DP) ▪ Model parallelism (MP) Parameter Server (PS) Worker 1 Worker 𝒐 … 𝑜 1 𝑜 1 Forward pass Backward pass Weights synchronized through PS or AllReduce • • GPU memory limitation Low GPU utilization 6
Parallelizing DNN Training ▪ Attempts to improve MP utilization • PipeDream [SOSP’19] • Pipelined model parallelism (PMP) • GPipe [NIPS’19] PMP Worker Forward pass Backward pass • Designed for homogeneous GPUs • Designed for a single PMP worker 7
HetPipe in a Nutshell Support Integrates PMP + DP Heterogeneous GPUs VW: A group of multiple GPUs Virtual Worker (VW) 1 DP PMP GPU GPU R R GPU GPU R GPU GPU G GPU GPU G G Parameter WSP … Server ( W ave S ynchronous P arallel) VW 𝒐 V GPU GPU V GPU GPU V PMP GPU GPU Q Q GPU GPU Q 8
Challenges in integration PMP+DP in Heterogeneous GPUs Parameter Server • What weight version should be used by each VW to synchronize with other VWs? • How do we reduce virtual worker stragglers … when we consider DP? … Many more in the paper 9
HetPipe Contributions Enable Large DNN Training on Heterogeneous GPUs Aggregate heterogeneous resources Reduce the straggler problem Integrates PMP + DP Novel parameter synchronization model WSP (Wave Synchronous Parallel) Proof of WSP Convergence 10
HetPipe Workflow Cluster Configuration VW 1 Resource Allocator P4 R P3 Assign k GPUs to each virtual worker R … P2 G DNN Model P1 G Time Model Partitioner PS Divide model into k partitions VW 𝒐 VW 1 VW 𝒐 … P4’ V … P3’ P1 P2 P1’ P2’ P3’ P4’ V P3 P4 P2’ Q P1’ Q 11
HetPipe Workflow Cluster Configuration VW 1 Resource Allocator P4 R P3 Assign k GPUs to each virtual worker R … P2 G DNN Model P1 G Local Model Partitioner … PS Staleness Global Divide model into k partitions VW 𝒐 VW 1 VW 𝒐 … P4’ V … P3’ P1 P2 P1’ P2’ P3’ P4’ V P3 P4 P2’ Q P1’ Q Local 12
Outline ▪ Motivation & Background ▪ HetPipe in a Nutshell ▪ Our System: HetPipe • Pipelined Model Parallelism Within a VW • Data Parallelism with Multiple VWs ▪ Evaluation ▪ Conclusion 13
Pipelined Model Parallelism Within a VW ▪ Execution of a virtual worker Time 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 GPU4 1 2 3 4 1 2 3 5 4 4 6 5 7 6 8 7 9 GPU3 … 1 2 3 4 1 2 5 3 6 4 7 5 8 6 9 7 GPU2 1 2 3 4 1 5 2 6 3 7 4 8 5 9 6 GPU1 Forward pass Backward pass 𝑶 𝒏 minibatches processed concurrently in pipeline manner 𝒙 𝒎𝒑𝒅𝒃𝒎 𝑿 𝒎𝒑𝒅𝒃𝒎 is a consistent version of weights within a VW 14
Pipelined Model Parallelism Within a VW ▪ Weight management procedure Time 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 GPU4 1 2 3 4 1 2 3 5 4 4 6 5 7 6 8 7 9 GPU3 … 1 2 3 4 1 2 5 3 6 4 7 5 8 6 9 7 GPU2 1 2 3 4 1 5 2 6 3 7 4 8 5 9 6 GPU1 𝑥 1 𝑥 2 𝑥 3 𝑥 4 Forward pass Backward pass Update 𝒗 𝟐 𝒙 𝟔 ← 𝒙 𝒎𝒑𝒅𝒃𝒎 Initial weight version ( 𝒙 𝟏 ) 𝑥 𝑚𝑝𝑑𝑏𝑚 = 𝑥 0 = 𝑥 1 = 𝑥 2 = 𝑥 3 = 𝑥 4 𝒙 𝒎𝒑𝒅𝒃𝒎 𝒙 𝒎𝒑𝒅𝒃𝒎 ← 𝒙 𝒎𝒑𝒅𝒃𝒎 + 𝒗 𝟐 15
Pipelined Model Parallelism Within a VW ▪ Local staleness ( 𝑻 𝒎𝒑𝒅𝒃𝒎 ): maximum missing updates Time 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 GPU4 1 2 3 4 1 2 3 5 4 4 6 5 7 6 8 7 9 GPU3 … 1 2 3 4 1 2 5 3 6 4 7 5 8 6 9 7 GPU2 1 2 3 4 1 5 2 6 3 7 4 8 5 9 6 GPU1 Forward pass Backward pass 𝒙 𝟔 ← 𝒙 𝒎𝒑𝒅𝒃𝒎 𝑻 𝒎𝒑𝒅𝒃𝒎 = 3 𝒙 𝟔 missing updates of minibatches 2 to 4 𝒙 𝒎𝒑𝒅𝒃𝒎 𝒙 𝒎𝒑𝒅𝒃𝒎 ← 𝒙 𝒎𝒑𝒅𝒃𝒎 + 𝒗 𝟐 16
Pipelined Model Parallelism Within a VW ▪ Local staleness ( 𝑻 𝒎𝒑𝒅𝒃𝒎 ): maximum missing updates Time 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 GPU4 1 2 3 4 1 2 3 5 4 4 6 5 7 6 8 7 9 GPU3 … 1 2 3 4 1 2 5 3 6 4 7 5 8 6 9 7 GPU2 1 2 3 4 1 5 2 6 3 7 4 8 5 9 6 GPU1 Forward pass Backward pass Update 𝒗 𝟑 𝒙 𝟕 ← 𝒙 𝒎𝒑𝒅𝒃𝒎 𝑻 𝒎𝒑𝒅𝒃𝒎 = 3 𝒙 𝟕 missing updates of minibatches 3 to 5 𝒙 𝒎𝒑𝒅𝒃𝒎 𝒙 𝒎𝒑𝒅𝒃𝒎 ← 𝒙 𝒎𝒑𝒅𝒃𝒎 + 𝒗 𝟑 17 𝒙 𝟏 + 𝒗 𝟐
Outline ▪ Motivation & Background ▪ HetPipe in a Nutshell ▪ Our System: HetPipe • Pipelined Model Parallelism Within a VW • Data Parallelism with Multiple VWs ▪ Evaluation ▪ Conclusion 18
Data Parallelism with Multiple VWs Wave : Sequence of concurrently executing 𝑂 𝑛 minibatches Wave 0 Wave 1 8 Minibatch 4 VW 1 7 Minibatch 3 𝑶 𝒏 … 6 Minibatch 2 5 Minibatch 1 Progress of minibatch execution … VW 𝒐 … 0 1 2 Clock Push & Pull Parameter Server: 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 19
Data Parallelism with Multiple VWs ▪ Push occurs every clock 4 8 VW 1 Blocked minibatch 8 3 7 2 6 Push aggregated updates of wave0 ( 𝑣 ) 1 5 𝑣 = 𝑣 1 + 𝑣 2 + 𝑣 3 + 𝑣 4 … Wave 0 VW 𝒐 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 ← 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 + 𝒗 … 0 1 Clock Push & Pull Parameter Server: 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 20
Data Parallelism with Multiple VWs ▪ Pull occurs intermittently - Depending on user defined clock distance D - If D = 0 pull occurs every clock 4 8 VW 1 3 7 VW1 waits before pull 2 6 until VW2 pushes 1 5 4 VW 2 3 2 1 0 1 Clock Push & Pull Parameter Server: 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 21
Data Parallelism with Multiple VWs ▪ Pull occurs intermittently - Depending on user defined clock distance D If D = 0 4 8 VW 1 3 7 7 VW1 waits before pull 2 6 6 until VW2 pushes 1 5 5 4 4 8 VW 2 VW2 Push aggregated updates ( 𝑣 ) 3 3 7 𝑥 𝑚𝑝𝑐𝑏𝑚 ← 𝑥 𝑚𝑝𝑐𝑏𝑚 + 𝑣 2 2 6 1 1 5 0 1 Clock Push & Pull Parameter Server: 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 22
Data Parallelism with Multiple VWs ▪ Pull occurs intermittently - Depending on user defined clock distance D If D = 0 4 8 VW 1 3 7 Pull occurs 2 6 after all VWs have been pushed 1 5 4 8 VW 2 𝑥 𝑚𝑝𝑑𝑏𝑚 ← 𝑥 𝑚𝑝𝑐𝑏𝑚 3 7 2 6 1 5 0 1 Clock Push & Pull Parameter Server: 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 23
Data Parallelism with Multiple VWs ▪ Pull occurs intermittently - Depending on user defined clock distance D If D = 0 4 8 VW 1 3 7 2 6 Minibatch 8 starts with 𝑥 8 1 5 𝑥 8 = 𝑥 0 +( 𝑣 1 + 𝑣 2 + 𝑣 3 + 𝑣 4 ) vw1,vw2 4 8 VW 2 3 7 2 6 1 5 0 1 Clock Push & Pull Parameter Server: 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 24
Data Parallelism with Multiple VWs ▪ Local staleness ( 𝑻 𝒎𝒑𝒅𝒃𝒎 ) and global staleness ( 𝑻 𝒉𝒎𝒑𝒄𝒃𝒎 ) with WSP 4 8 8 12 VW 1 𝑥 11 = 𝑥 0 + ( 𝑣 1 + 𝑣 2 + 𝑣 3 + 𝑣 4 ) vw1,vw2 3 7 7 11 11 2 6 6 10 + ( 𝑣 5 + 𝑣 6 + 𝑣 7 ) vw1 1 5 5 9 4 8 8 VW 2 𝑻 𝒎𝒑𝒅𝒃𝒎 ( 𝑣 8 + 𝑣 9 + 𝑣 10 ) vw1 𝑻 𝒉𝒎𝒑𝒄𝒃𝒎 3 7 7 2 6 6 ( 𝑣 5 + 𝑣 6 + 𝑣 7 ) vw2 1 5 5 0 1 2 Clock 𝑥 𝑚𝑝𝑐𝑏𝑚 = 𝑥 0 + ( 𝑣 1 + 𝑣 2 + 𝑣 3 + 𝑣 4 ) vw1,vw2 25
Data Parallelism with Multiple VWs ▪ Local staleness ( 𝑻 𝒎𝒑𝒅𝒃𝒎 ) and global staleness ( 𝑻 𝒉𝒎𝒑𝒄𝒃𝒎 ) with WSP 4 8 12 VW 1 3 7 11 Minibatch 12 has to wait 2 6 10 1 5 9 4 8 VW 2 3 7 2 6 1 5 0 1 2 Clock Parameter Server: 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 26
Data Parallelism with Multiple VWs ▪ Example of clock distance threshold D If D = 1 4 8 VW 1 3 7 2 6 Can start minibatch 8 without pull 1 5 4 VW 2 3 2 1 0 1 Clock Push & Pull Parameter Server: 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 27
Data Parallelism with Multiple VWs ▪ Example of clock distance threshold D If D = 1 Minibatch 12 has to wait 4 8 12 VW 1 3 7 11 11 11 𝑥 11 = 𝑥 0 +( 𝑣 1 + 𝑣 2 + 𝑣 3 + 𝑣 4 + 𝑣 5 + 𝑣 6 + 𝑣 7 ) vw1 2 6 10 1 5 9 𝑻 𝒎𝒑𝒅𝒃𝒎 ( 𝑣 8 + 𝑣 9 + 𝑣 10 ) vw1 4 VW 2 𝑻 𝒉𝒎𝒑𝒄𝒃𝒎 3 7 ( 𝑣 1 + 𝑣 2 + 𝑣 3 + 𝑣 4 + 𝑣 5 + 𝑣 6 + 𝑣 7 ) vw2 2 6 1 5 0 1 2 Clock 𝑥 𝑚𝑝𝑐𝑏𝑚 = 𝑥 0 28
Recommend
More recommend