HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU - PowerPoint PPT Presentation

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi † , Sam H. Noh, and Young-ri Choi †

Contents ▪ Motivation & Background ▪ HetPipe in a Nutshell ▪ Our System: HetPipe ▪ Evaluation ▪ Conclusion 2

Motivation ▪ DNN (Deep Neural Network) models continue to grow • Need more powerful GPUs for training! 3

Motivation ▪ Short release cycle of new GPU architectures Whimpy GPUs • Use of heterogeneous GPUs is inevitable! • What to do with whimpy GPUs? 4

DNN Training Forward Pass 𝒋 Cat? Loss Weight Parameter 𝒙 Minibatch 𝒋 (Training Data) Backward Pass 𝒋 𝒙 𝒋+𝟐 = 𝒙 𝒋 − 𝜽 ∙ 𝒗 𝒋 5

Parallelizing DNN Training ▪ Data parallelism (DP) ▪ Model parallelism (MP) Parameter Server (PS) Worker 1 Worker 𝒐 … 𝑜 1 𝑜 1 Forward pass Backward pass Weights synchronized through PS or AllReduce • • GPU memory limitation Low GPU utilization 6

Parallelizing DNN Training ▪ Attempts to improve MP utilization • PipeDream [SOSP’19] • Pipelined model parallelism (PMP) • GPipe [NIPS’19] PMP Worker Forward pass Backward pass • Designed for homogeneous GPUs • Designed for a single PMP worker 7

HetPipe in a Nutshell Support Integrates PMP + DP Heterogeneous GPUs VW: A group of multiple GPUs Virtual Worker (VW) 1 DP PMP GPU GPU R R GPU GPU R GPU GPU G GPU GPU G G Parameter WSP … Server ( W ave S ynchronous P arallel) VW 𝒐 V GPU GPU V GPU GPU V PMP GPU GPU Q Q GPU GPU Q 8

Challenges in integration PMP+DP in Heterogeneous GPUs Parameter Server • What weight version should be used by each VW to synchronize with other VWs? • How do we reduce virtual worker stragglers … when we consider DP? … Many more in the paper 9

HetPipe Contributions Enable Large DNN Training on Heterogeneous GPUs Aggregate heterogeneous resources Reduce the straggler problem Integrates PMP + DP Novel parameter synchronization model WSP (Wave Synchronous Parallel) Proof of WSP Convergence 10

HetPipe Workflow Cluster Configuration VW 1 Resource Allocator P4 R P3 Assign k GPUs to each virtual worker R … P2 G DNN Model P1 G Time Model Partitioner PS Divide model into k partitions VW 𝒐 VW 1 VW 𝒐 … P4’ V … P3’ P1 P2 P1’ P2’ P3’ P4’ V P3 P4 P2’ Q P1’ Q 11

HetPipe Workflow Cluster Configuration VW 1 Resource Allocator P4 R P3 Assign k GPUs to each virtual worker R … P2 G DNN Model P1 G Local Model Partitioner … PS Staleness Global Divide model into k partitions VW 𝒐 VW 1 VW 𝒐 … P4’ V … P3’ P1 P2 P1’ P2’ P3’ P4’ V P3 P4 P2’ Q P1’ Q Local 12

Outline ▪ Motivation & Background ▪ HetPipe in a Nutshell ▪ Our System: HetPipe • Pipelined Model Parallelism Within a VW • Data Parallelism with Multiple VWs ▪ Evaluation ▪ Conclusion 13

Pipelined Model Parallelism Within a VW ▪ Execution of a virtual worker Time 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 GPU4 1 2 3 4 1 2 3 5 4 4 6 5 7 6 8 7 9 GPU3 … 1 2 3 4 1 2 5 3 6 4 7 5 8 6 9 7 GPU2 1 2 3 4 1 5 2 6 3 7 4 8 5 9 6 GPU1 Forward pass Backward pass 𝑶 𝒏 minibatches processed concurrently in pipeline manner 𝒙 𝒎𝒑𝒅𝒃𝒎 𝑿 𝒎𝒑𝒅𝒃𝒎 is a consistent version of weights within a VW 14

Pipelined Model Parallelism Within a VW ▪ Weight management procedure Time 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 GPU4 1 2 3 4 1 2 3 5 4 4 6 5 7 6 8 7 9 GPU3 … 1 2 3 4 1 2 5 3 6 4 7 5 8 6 9 7 GPU2 1 2 3 4 1 5 2 6 3 7 4 8 5 9 6 GPU1 𝑥 1 𝑥 2 𝑥 3 𝑥 4 Forward pass Backward pass Update 𝒗 𝟐 𝒙 𝟔 ← 𝒙 𝒎𝒑𝒅𝒃𝒎 Initial weight version ( 𝒙 𝟏 ) 𝑥 𝑚𝑝𝑑𝑏𝑚 = 𝑥 0 = 𝑥 1 = 𝑥 2 = 𝑥 3 = 𝑥 4 𝒙 𝒎𝒑𝒅𝒃𝒎 𝒙 𝒎𝒑𝒅𝒃𝒎 ← 𝒙 𝒎𝒑𝒅𝒃𝒎 + 𝒗 𝟐 15

Pipelined Model Parallelism Within a VW ▪ Local staleness ( 𝑻 𝒎𝒑𝒅𝒃𝒎 ): maximum missing updates Time 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 GPU4 1 2 3 4 1 2 3 5 4 4 6 5 7 6 8 7 9 GPU3 … 1 2 3 4 1 2 5 3 6 4 7 5 8 6 9 7 GPU2 1 2 3 4 1 5 2 6 3 7 4 8 5 9 6 GPU1 Forward pass Backward pass 𝒙 𝟔 ← 𝒙 𝒎𝒑𝒅𝒃𝒎 𝑻 𝒎𝒑𝒅𝒃𝒎 = 3 𝒙 𝟔 missing updates of minibatches 2 to 4 𝒙 𝒎𝒑𝒅𝒃𝒎 𝒙 𝒎𝒑𝒅𝒃𝒎 ← 𝒙 𝒎𝒑𝒅𝒃𝒎 + 𝒗 𝟐 16

Pipelined Model Parallelism Within a VW ▪ Local staleness ( 𝑻 𝒎𝒑𝒅𝒃𝒎 ): maximum missing updates Time 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 GPU4 1 2 3 4 1 2 3 5 4 4 6 5 7 6 8 7 9 GPU3 … 1 2 3 4 1 2 5 3 6 4 7 5 8 6 9 7 GPU2 1 2 3 4 1 5 2 6 3 7 4 8 5 9 6 GPU1 Forward pass Backward pass Update 𝒗 𝟑 𝒙 𝟕 ← 𝒙 𝒎𝒑𝒅𝒃𝒎 𝑻 𝒎𝒑𝒅𝒃𝒎 = 3 𝒙 𝟕 missing updates of minibatches 3 to 5 𝒙 𝒎𝒑𝒅𝒃𝒎 𝒙 𝒎𝒑𝒅𝒃𝒎 ← 𝒙 𝒎𝒑𝒅𝒃𝒎 + 𝒗 𝟑 17 𝒙 𝟏 + 𝒗 𝟐

Outline ▪ Motivation & Background ▪ HetPipe in a Nutshell ▪ Our System: HetPipe • Pipelined Model Parallelism Within a VW • Data Parallelism with Multiple VWs ▪ Evaluation ▪ Conclusion 18

Data Parallelism with Multiple VWs Wave : Sequence of concurrently executing 𝑂 𝑛 minibatches Wave 0 Wave 1 8 Minibatch 4 VW 1 7 Minibatch 3 𝑶 𝒏 … 6 Minibatch 2 5 Minibatch 1 Progress of minibatch execution … VW 𝒐 … 0 1 2 Clock Push & Pull Parameter Server: 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 19

Data Parallelism with Multiple VWs ▪ Push occurs every clock 4 8 VW 1 Blocked minibatch 8 3 7 2 6 Push aggregated updates of wave0 ( ෤ 𝑣 ) 1 5 𝑣 = 𝑣 1 + 𝑣 2 + 𝑣 3 + 𝑣 4 ෤ … Wave 0 VW 𝒐 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 ← 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 + ෥ 𝒗 … 0 1 Clock Push & Pull Parameter Server: 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 20

Data Parallelism with Multiple VWs ▪ Pull occurs intermittently - Depending on user defined clock distance D - If D = 0 pull occurs every clock 4 8 VW 1 3 7 VW1 waits before pull 2 6 until VW2 pushes 1 5 4 VW 2 3 2 1 0 1 Clock Push & Pull Parameter Server: 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 21

Data Parallelism with Multiple VWs ▪ Pull occurs intermittently - Depending on user defined clock distance D If D = 0 4 8 VW 1 3 7 7 VW1 waits before pull 2 6 6 until VW2 pushes 1 5 5 4 4 8 VW 2 VW2 Push aggregated updates ( ෤ 𝑣 ) 3 3 7 𝑥 𝑕𝑚𝑝𝑐𝑏𝑚 ← 𝑥 𝑕𝑚𝑝𝑐𝑏𝑚 + ෤ 𝑣 2 2 6 1 1 5 0 1 Clock Push & Pull Parameter Server: 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 22

Data Parallelism with Multiple VWs ▪ Pull occurs intermittently - Depending on user defined clock distance D If D = 0 4 8 VW 1 3 7 Pull occurs 2 6 after all VWs have been pushed 1 5 4 8 VW 2 𝑥 𝑚𝑝𝑑𝑏𝑚 ← 𝑥 𝑕𝑚𝑝𝑐𝑏𝑚 3 7 2 6 1 5 0 1 Clock Push & Pull Parameter Server: 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 23

Data Parallelism with Multiple VWs ▪ Pull occurs intermittently - Depending on user defined clock distance D If D = 0 4 8 VW 1 3 7 2 6 Minibatch 8 starts with 𝑥 8 1 5 𝑥 8 = 𝑥 0 +( 𝑣 1 + 𝑣 2 + 𝑣 3 + 𝑣 4 ) vw1,vw2 4 8 VW 2 3 7 2 6 1 5 0 1 Clock Push & Pull Parameter Server: 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 24

Data Parallelism with Multiple VWs ▪ Local staleness ( 𝑻 𝒎𝒑𝒅𝒃𝒎 ) and global staleness ( 𝑻 𝒉𝒎𝒑𝒄𝒃𝒎 ) with WSP 4 8 8 12 VW 1 𝑥 11 = 𝑥 0 + ( 𝑣 1 + 𝑣 2 + 𝑣 3 + 𝑣 4 ) vw1,vw2 3 7 7 11 11 2 6 6 10 + ( 𝑣 5 + 𝑣 6 + 𝑣 7 ) vw1 1 5 5 9 4 8 8 VW 2 𝑻 𝒎𝒑𝒅𝒃𝒎 ( 𝑣 8 + 𝑣 9 + 𝑣 10 ) vw1 𝑻 𝒉𝒎𝒑𝒄𝒃𝒎 3 7 7 2 6 6 ( 𝑣 5 + 𝑣 6 + 𝑣 7 ) vw2 1 5 5 0 1 2 Clock 𝑥 𝑕𝑚𝑝𝑐𝑏𝑚 = 𝑥 0 + ( 𝑣 1 + 𝑣 2 + 𝑣 3 + 𝑣 4 ) vw1,vw2 25

Data Parallelism with Multiple VWs ▪ Local staleness ( 𝑻 𝒎𝒑𝒅𝒃𝒎 ) and global staleness ( 𝑻 𝒉𝒎𝒑𝒄𝒃𝒎 ) with WSP 4 8 12 VW 1 3 7 11 Minibatch 12 has to wait 2 6 10 1 5 9 4 8 VW 2 3 7 2 6 1 5 0 1 2 Clock Parameter Server: 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 26

Data Parallelism with Multiple VWs ▪ Example of clock distance threshold D If D = 1 4 8 VW 1 3 7 2 6 Can start minibatch 8 without pull 1 5 4 VW 2 3 2 1 0 1 Clock Push & Pull Parameter Server: 𝒙 𝒉𝒎𝒑𝒄𝒃𝒎 27

Data Parallelism with Multiple VWs ▪ Example of clock distance threshold D If D = 1 Minibatch 12 has to wait 4 8 12 VW 1 3 7 11 11 11 𝑥 11 = 𝑥 0 +( 𝑣 1 + 𝑣 2 + 𝑣 3 + 𝑣 4 + 𝑣 5 + 𝑣 6 + 𝑣 7 ) vw1 2 6 10 1 5 9 𝑻 𝒎𝒑𝒅𝒃𝒎 ( 𝑣 8 + 𝑣 9 + 𝑣 10 ) vw1 4 VW 2 𝑻 𝒉𝒎𝒑𝒄𝒃𝒎 3 7 ( 𝑣 1 + 𝑣 2 + 𝑣 3 + 𝑣 4 + 𝑣 5 + 𝑣 6 + 𝑣 7 ) vw2 2 6 1 5 0 1 2 Clock 𝑥 𝑕𝑚𝑝𝑐𝑏𝑚 = 𝑥 0 28

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU - PowerPoint PPT Presentation

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi , Sam

DNN-based Branch-and-bound for the Quadratic Assignment Problem *Koichi Fujii, Naoki Ito, Yuji

The Dark Side of DNN Pruning Reza Yazdani Marc Riera Jose-Maria Arnau Antonio Gonzlez

Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training Hongyu Zhu 1,2 ,

Enabling Future Enabling Future Technology Technology Ultra-Large-Scale Systems

Power-Driven DNN Dataflow Optimization on FPGA Qi Sun 1 , Tinghuan Chen 1 , Jin Miao 2 , Bei Yu 1 1

Outlier Channel Splitting Improving DNN Quantization without Retraining Ritchie Zhao , Yuwei Hu,

CNVLUTIN: Ineffectual-neuron-free DNN computing J. Albericio , P. Judd, T. Hetherington*, T.

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity 2020/11

Contents Introduction Pipelined FPGA DNN accelerators Roof-line Model and optimizing

Multilingual and low-resource ASR Lecture 18 CS 753 Instructor: Preethi Jyothi Recall Hybrid

T T T The CDO Blueprint: Enabling the he CDO Blueprint: Enabling the he CDO Blueprint:

Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram Venkataraman, Amar Phanishayee,

Edge-based Discovery of Training Data for Machine Learning CMU authors Deep Learning Recipe

Standalone Training of Context-Dependent Deep Neural Network Acoustic Models Chao Zhang &

SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang , 1 Shiding Lin, 1 Wei

DNN#AssistedParameterSpace* ExplorationandVisualizationfor LargeScaleSimulations HA

The new UK Financial Reporting regime Presented by Steve Carlyle steve@clearlytraining.co.uk

SUPER FUND RESPONSIBLE INVESTMENT BENCHMARK REPORT 2018 RIAA MEMBER ONLY WEBINAR 21 June 2018

Blue Carbon: A new management tool for coastal conservation and restoration Workshop Nessler

Q3 2019 Financial Results November 15, 2019 Tracy Pagliara Randy Lay President and CEO SVP

rqts

Arithmetic Division Distributed Arithmetic Newton Raphson Newton Raphson CORDIC Unsigned

Geometry Construction Languages Neuper guide User-Interaction Calculations Lucas Interpreter

An X-band System-in-Package Active Antenna Module 96868804

Sambuz

Useful Links

Newsletter

Mail Us

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU - PowerPoint PPT Presentation

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi , Sam

DNN-based Branch-and-bound for the Quadratic Assignment Problem *Koichi Fujii, Naoki Ito, Yuji

The Dark Side of DNN Pruning Reza Yazdani Marc Riera Jose-Maria Arnau Antonio Gonzlez

Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training Hongyu Zhu 1,2 ,

Enabling Future Enabling Future Technology Technology Ultra-Large-Scale Systems

Power-Driven DNN Dataflow Optimization on FPGA Qi Sun 1 , Tinghuan Chen 1 , Jin Miao 2 , Bei Yu 1 1

Outlier Channel Splitting Improving DNN Quantization without Retraining Ritchie Zhao , Yuwei Hu,

CNVLUTIN: Ineffectual-neuron-free DNN computing J. Albericio , P. Judd, T. Hetherington*, T.

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity 2020/11

Contents Introduction Pipelined FPGA DNN accelerators Roof-line Model and optimizing

Multilingual and low-resource ASR Lecture 18 CS 753 Instructor: Preethi Jyothi Recall Hybrid

T T T The CDO Blueprint: Enabling the he CDO Blueprint: Enabling the he CDO Blueprint:

Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram Venkataraman, Amar Phanishayee,

Edge-based Discovery of Training Data for Machine Learning CMU authors Deep Learning Recipe

Standalone Training of Context-Dependent Deep Neural Network Acoustic Models Chao Zhang &amp;

SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang , 1 Shiding Lin, 1 Wei

DNN#Assisted*Parameter*Space* Exploration*and*Visualization*for* Large*Scale*Simulations HA

The new UK Financial Reporting regime Presented by Steve Carlyle steve@clearlytraining.co.uk

SUPER FUND RESPONSIBLE INVESTMENT BENCHMARK REPORT 2018 RIAA MEMBER ONLY WEBINAR 21 June 2018

Blue Carbon: A new management tool for coastal conservation and restoration Workshop Nessler

Q3 2019 Financial Results November 15, 2019 Tracy Pagliara Randy Lay President and CEO SVP

rqts

Arithmetic Division Distributed Arithmetic Newton Raphson Newton Raphson CORDIC Unsigned

Geometry Construction Languages Neuper guide User-Interaction Calculations Lucas Interpreter

An X-band System-in-Package Active Antenna Module 96868804

Sambuz

Useful Links

Newsletter

Mail Us

Standalone Training of Context-Dependent Deep Neural Network Acoustic Models Chao Zhang &

DNN#AssistedParameterSpace* ExplorationandVisualizationfor LargeScaleSimulations HA