PipeDream: Generalized Pipeline Parallelism for DNN Training Deepak - PowerPoint PPT Presentation

PipeDream: Generalized Pipeline Parallelism for DNN Training Deepak Narayanan § , Aaron Harlap † , Amar Phanishayee ★ , Vivek Seshadri ★ , Nikhil R. Devanur ★ , Gregory R. Ganger † , Phillip B. Gibbons † , Matei Zaharia § ★ Microsoft Research † Carnegie Mellon University § Stanford University

Deep Neural Networks have empowered state of the art results across a range of applications… வண#க% எ' ெபய+ த-ப# Hello, my name is Deepak dog cat Image Classification Machine Translation Speech-to-Text Game Playing 2

…but first need to be trained! activations prediction $ " = ! " = lio 0 lion *% ! " = tiger Weight parameters % loss(! " , 0 ! " ) gradients % optimized using standard iterative optimization procedures % = % − ' ⋅ ∇% 3

Background: DNN Training activations prediction Model training time- and compute- intensive! $ " = ! " = lio 0 lion *% ! " = tiger Weight parameters % loss(! " , 0 ! " ) gradients W optimized using standard iterative optimization procedures % = % − ' ⋅ ∇% 4

Parallelizing DNN Training: Data Parallelism Despite many performance optimizations, ) copies of the communication overhead high! same model … … ∇" $ ∇" ( Worker * Worker 1 ∇" = ∇" $ + ∇" & + ⋯ + ∇" ( 8xV100s with NVLink (AWS) Gradient aggregation using AllReduce PyTorch + NCCL 2.4 5

Parallelizing DNN training: Model Parallelism Low hardware efficiency Worker ! Worker 1 All inputs Single version of weights split over workers Activations and gradients sent between workers using peer-to-peer communication 6

PipeDream: Pipeline-Parallel Training We propose pipeline parallelism , a combination of data and model parallelism with pipelining Pipeline-parallel training up to 5.3x faster than data parallelism without sacrificing on final accuracy of the model 7

Pipelining in DNN Training != Traditional Pipelining • How should the operators in a DNN model be partitioned into pipeline stages? • Each operator has a different computation time • Activations and gradients need to be communicated across stages • How should forward and backward passes of different inputs be scheduled? • Training is bidirectional • Forward pass followed by backward pass to compute gradients • How should weight and activation versions be managed? • Backward pass operators depend on internal state ( ! , activations) 8

Outline • Background and Motivation • Challenges for effective pipeline-parallel training • Partitioning and load balancing operators across workers • Scheduling of forward and backward passes of different inputs • Managing weights and activation versions for effective learning • Evaluation 9

How do we assign operators to pipeline stages? Stage 1 Stage 2 Stage 3 comm comm ! " ! # ! $ ! "→# ! #→$ • Desiderata #1: ! " , ! # , ! $ as close to each other as possible • Compute resources seldom idle → better hardware efficiency comm and ! #→$ comm minimized • Desiderata #2: ! "→# • Less communication → better hardware efficiency 10

How do we assign operators to pipeline stages? Throughput = Compute time = 2 !int Compute time = 2 (1 / 2) × 2 = 1 Throughput = 1 Compute time = 1 % ' & & For some operators, ∑ & ' & < 2!int Better load balancing across stages Data-parallel communication small Replication of stages helps load balance computation and reduce communication between workers 11

Example PipeDream configuration Configuration: 2-3-2-1 Stages can have different replication factors 12

PipeDream Profiler and Optimizer Computational Determines a partitioning of operators Input DNN graph with profile amongst workers, while also deciding Profiler replication factors Generalizes along many axes Optimizer Hardware topologies • Model structures • Memory capacities of workers • Deployment constraints such as See paper for details of number of accelerators, memory and algorithm! interconnect characteristics 13

1F1B Scheduling Workers alternate between forward and backward passes • Workers always utilized • Gradients used to update model immediately To support stage replication, need to modify this mechanism slightly – see paper for details! 15

Naïve pipelining leads to weight version mismatches Naïve pipelining leads to mismatch in weight versions % # $ # " Forward pass # " #&) " ∇$ # ∇% # Backward pass #&' Input ! sees updates in backward pass not seen in the forward pass, leading to incorrect gradients 17

1F1B Scheduling + Weight Stashing Naïve pipelining leads to mismatch in weight versions Store multiple <weight, activation> versions • Ensures same weight versions used in both forward and backward pass $ " # " ! Forward pass " ! "() ! ! "() ! " "(2 Stashed weights ∇# " % & ∇$ " Backward pass ( / ( 0 ) " ) - • Worst case memory footprint similar to data parallelism ( = + ⋅ 18

Outline • Background and Motivation • Challenges for effective pipeline-parallel training • Evaluation • Setup • Comparison to Data Parallelism on Time-to-Accuracy • Communication Overhead of Pipeline Parallelism • Comparison to Model Parallelism and Hybrid Parallelism on Throughput • PipeDream’s Memory Footprint 19

Evaluation Setup • Integrated PipeDream with PyTorch in ~3000 lines of Python code • Integrated with PyTorch’s communication library • NCCL backend for Data Parallelism baselines • Gloo backend for PipeDream • Experiments run on three different server types • Cluster A: 4xV100 GPUs, PCIe intra-server, and 10 Gbps inter-server (Azure) • Cluster B: 8xV100 GPUs, NVLink intra-server, and 25 Gbps inter-server (AWS) • Cluster C: 1xTitan X, and 40 Gbps inter-server (private) 20

PipeDream > Data Parallelism (DP) end-to-end 2.46x faster 5.28x faster 21

PipeDream vs. Data Parallelism on Time-to-Accuracy 22

PipeDream vs. Data Parallelism on Time-to-Accuracy Experiments on 4 different tasks: image classification, translation, language modeling, video captioning 23

PipeDream vs. Data Parallelism on Time-to-Accuracy With the same number of GPUs, PipeDream up to 5.3x faster than Data Parallelism 24

PipeDream vs. Data Parallelism on Time-to-Accuracy Optimizer recommends a number of different configurations like 15-1, Straight, and a fully data-parallel setup 25

PipeDream reduces communication overhead For many models, intermediate activations and gradients order of magnitude smaller than communication with Data Parallelism (DP) 26

Conclusion • Model and data parallelism often suffer from high communication overhead and low resource utilization for certain models and deployments • PipeDream shows pipelining can be used to accelerate DNN training • Pipelining, when combined with data and model parallelism in a principled way, achieves end-to-end speedups of up to 5.3x Code available at https://github.com/msr-fiddle/pipedream https://cs.stanford.edu/~deepakn/ 27

PipeDream: Generalized Pipeline Parallelism for DNN Training Deepak - PowerPoint PPT Presentation

PipeDream: Generalized Pipeline Parallelism for DNN Training Deepak Narayanan , Aaron Harlap , Amar Phanishayee , Vivek Seshadri , Nikhil R. Devanur , Gregory R. Ganger , Phillip B. Gibbons , Matei Zaharia

- Varun Batra Why PipeDream? Pipeline Parallelism Partitioning Scheduling Learning

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

DNN-based Branch-and-bound for the Quadratic Assignment Problem *Koichi Fujii, Naoki Ito, Yuji

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

The Dark Side of DNN Pruning Reza Yazdani Marc Riera Jose-Maria Arnau Antonio Gonzlez

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CS 744: PIPEDREAM Shivaram Venkataraman Fall 2020 ADMINISTRIVIA - Assignment 2 is due Oct

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal

Generalized MPLS Signaling draft-ietf-mpls-generalized-signaling-05.txt

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

Ideas for evolution of replication technology @ CERN Openlab Minor Review December 14 th , 2010

The Multikernel: A new OS architecture for scalable multicore systems Andrew Baumann, Paul

Reconfigurable hardware for big ig data Gustavo Alonso Systems Group Department of Computer

Replication and Consistency Setting: Concurrent threads accessing shared data Roland Meyer (TU

1 What Limits Performance? Stalls (Data Hazards) Data hazards Code Instruction depends on

The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert

DRBD 9 Linux Storage Replication Lars Ellenberg LINBIT HA Solutions GmbH Vienna, Austria

High availability and analysis of PostgreSQL Sergey Kalinin 18-19 of April 2012, dCache

Sambuz

Useful Links

Newsletter

Mail Us