HyPar-Flow : Exploiting MPI and Keras for Scalable Hy brid- Par allel - PowerPoint PPT Presentation

HyPar-Flow : Exploiting MPI and Keras for Scalable Hy brid- Par allel DNN Training with Tensor Flow Ammar Ahmad Awan, Arpan Jain , Quentin Anthony, Hari Subramoni, and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) Dept. of Computer Science and Engineering , The Ohio State University {awan.10, jain.575, anthony.301, subramoni.1, panda.2}@osu.edu

Agenda • Introduction and Motivation • Problems and Challenges • Key Contribution • Performance Characterization • Conclusion Network Based Computing Laboratory High-Performance Deep Learning 2 CSE 5194

The Deep Learning (DL) Revolution Deep Learning – A technique to achieve Artificial Intelligence • – Uses Deep Neural Networks Machine Deep Learning AI Learning (DL) (ML) Examples: Examples: MLPs, DNNs, Logistic Regression Adopted from: http://www.deeplearningbook.org/contents/intro.html Sourc rce: https://thenewstack. k.io io/demys ystif ifyin ying-dee eep-le learnin ing-an and-ar artificial al-in intellig lligence/ Network Based Computing Laboratory High-Performance Deep Learning 3 CSE 5194

Accelerator/CP Family Deep Learning meets Super Computers Performance Share • NVIDIA GPUs - major force for accelerating DL workloads – Comput putationa nal r requi quirement nt is increasing ng e expo pone nent ntially www.top500.org Courtesy: https://openai.com/blog/ai-and-compute/ Network Based Computing Laboratory High-Performance Deep Learning 4 CSE 5194

How to make Training Faster? • Data parallelism – Horovod: TensorFlow, PyTorch, and MXNet – TensorFlow: tf.distribute.Strategy API – PyTorch: torch.nn.parallel.DistributedDataParallel • Model-parallelism and Hybrid-parallelism – No framework-level support – Only LBANN supports it within the framework – Higher-level frameworks: Gpipe, Mesh-TensorFlow, etc. Network Based Computing Laboratory High-Performance Deep Learning 5 CSE 5194

Distributed/Parallel Training Strategies for DNNs • Data Parallelism (most common) • Model and Hybrid Parallelism (emerging) • ‘X’-Parallelism – ‘X’—> Spatial, Channel, Filter, etc. Data Parallelism Model Parallelism Hybrid (Model and Data) Parallelism Courtesy: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks Network Based Computing Laboratory High-Performance Deep Learning 6 CSE 5194

Why Model Parallelism? • Data Parallelism – only for models that fit the memory • Out-of-core models – Deeper model  Better accuracy but more memory required! • Model parallelism can work for out-of-core models! • Designing a system for model- parallelism is challenging Network Based Computing Laboratory High-Performance Deep Learning 7 CSE 5194

Major Problems • Defining a distributed model -- necessary but difficult – requires knowledge of the model, communication library, and distributed hardware • Implementing distributed forward/back-propagation – needed because partitions reside in different memory spaces and need explicit communication • Obtaining parallel speedup on an inherently sequential task – forward pass followed by a backward pass – Limited opportunity for parallelism and scalability • Achieving scalability without losing out on a model’s accuracy – Valid concern for all types of parallelism strategies Network Based Computing Laboratory High-Performance Deep Learning 9 CSE 5194

Research Challenges Challenge-2: Challenge-3: Challenge-1: Model- Communication Applying HPC Definition APIs and between Partitions and Techniques to Framework-specific Replicas Improve Features Performance Meet HyPar-Flow! Network Based Computing Laboratory High-Performance Deep Learning 10 CSE 5194

Agenda • Introduction and Motivation • Problems and Challenges • Key Contribution • Performance Evaluation • Conclusion Network Based Computing Laboratory High-Performance Deep Learning 11 CSE 5194

Key Contribution: Propose, Design, and Evaluate HyPar-Flow • HyPar-Flow is practical (easy-to-use) and high-performance (uses MPI) – Based on Keras models and exploits TF 2.0 Eager Execution – Leverages performance of MPI pt-to-pt. and collectives for communication Network Based Computing Laboratory High-Performance Deep Learning 12 CSE 5194

HyPar-Flow: Overview Network Based Computing Laboratory High-Performance Deep Learning 13 CSE 5194

HyPar-Flow: Components • Model Generator is crucial for productivity • Load Balancer is crucial for performance • Trainer – Core of Back-propagation – Several system-level challenges – Communication of tensors – Blocking or non-blocking – Efficient pipelining is needed • Communication Engine – Isolate communication interfaces – Unified Data, Model, and Hybrid Parallelism Network Based Computing Laboratory High-Performance Deep Learning 14 CSE 5194

Special Handling for Models with Skip Connections Network Based Computing Laboratory High-Performance Deep Learning 15 CSE 5194

Evaluation Setup • 3 Systems – Frontera at Texas Advanced Computing Center (TACC) – Stampede2 (Skylake partition) at TACC – AMD EPYC: Local system with dual-socket AMD EPYC 7551 32-core processors. • Interconnect – Frontera -- Mellanox InfiniBand HDR- 100 HCAs – Stampede2 -- Intel Omni-Path HFIs. • TensorFlow v1.13, MVAPICH2 2.3.2 on Frontera and Epyc, and Intel MPI 2018 on Stampede2 • We use and modify model definitions for VGG and ResNet(s) from keras.applications Network Based Computing Laboratory High-Performance Deep Learning 17 CSE 5194

Verifying the Correctness of HyPar-Flow • The following variants have been compared: – SEQ (GT) - Sequential using tf.GradientTape (GT). – SEQ (MF) - Sequential using model.fit (MF). – SEQ (MF-E) - Sequential using model.fit (MF) and (E)ager Execution. – HF-MP (2)/(56) - HyPar-Flow model-parallel with 2/48 model-partitions. VGG-16 ResNet-110 ResNet-1k Network Based Computing Laboratory High-Performance Deep Learning 18 CSE 5194

Model/Hybrid Parallelism on single/two nodes • ResNet-1k -- scales with batch size on one node as well as two nodes • Reason for scaling? – Counter-intuitive for model-parallelism to scale better than data-parallelism – Poor CPU implementation? Network Based Computing Laboratory High-Performance Deep Learning 19 CSE 5194

Hybrid Parallelism for AmoebaNet • AmoebaNet -- different architecture compared to ResNet(s) • More branches and skip connections • Scales well using HyPar-Flow • Memory-hungry so single node restricted to BatchSize=64 Network Based Computing Laboratory High-Performance Deep Learning 20 CSE 5194

HyPar-Flow (HF): Flexibility and Scalability • CPU based results – AMD EPYC – Intel Xeon • Excellent speedups for – VGG-19 – ResNet-110 – ResNet-1000 (1k layers) • Able to train “future” models – E.g. ResNet-5000 (a synthetic 5000-layer model we benchmarked) 110x speedup on 128 Intel Xeon Skylake nodes (TACC Stampede2) Network Based Computing Laboratory High-Performance Deep Learning 21 CSE 5194

HyPar-Flow at Scale (512 nodes on TACC Frontera) • ResNet-1001 with variable batch size • Approach: – 48 model-partitions for 56 cores – 512 model-replicas for 512 nodes – Total cores: 48 x 512 = 24,576 • Speedup – 253X on 256 nodes – 481X on 512 nodes • Scaling Efficiency – 98% up to 256 nodes – 93.9% for 512 nodes 481x speedup on 512 Intel Xeon Skylake nodes (TACC Frontera) Network Based Computing Laboratory High-Performance Deep Learning 22 CSE 5194

Agenda • Introduction and Motivation • Problems and Challenges • Key Contribution • Performance Evaluation • Conclusion Network Based Computing Laboratory High-Performance Deep Learning 23 CSE 5194

Conclusion • In-depth analysis of Data/Model/Hybrid parallelism – The need for model/hybrid parallelism -- larger models • Proposed and Designed HyPar-Flow – Flexible and user-transparent system – Leverages existing technologies instead of reinventing anything – Keras, TensorFlow, and MPI for flexibility and scalability • Performance Evaluation on large systems – Three HPC clusters including Frontera at TACC (#5 on Top500) – Three DNNs with diverse requirements and sizes (VGG, ResNet-110/1k, and AmoebaNet) – 93% scaling efficiency on 512 nodes (Frontera) Network Based Computing Laboratory High-Performance Deep Learning 24 CSE 5194

HyPar-Flow : Exploiting MPI and Keras for Scalable Hy brid- Par allel - PowerPoint PPT Presentation

HyPar-Flow : Exploiting MPI and Keras for Scalable Hy brid- Par allel DNN Training with Tensor Flow Ammar Ahmad Awan, Arpan Jain , Quentin Anthony, Hari Subramoni, and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) Dept. of

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Hy Hybr brid id CoE CoE The Eu e Europ opean ean Ce Centr tre e of of Ex Excel

8 2 J O H N S TREET - TW O S TO RY A D D ITIO N 1 8 BRID G E S TREET - EN TRY A W N IN G

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Keras: Performance Analysis of Tensorflow, Theano, and CNTK Backends R244 Presentation By:

Deep Neural Nets and Keras Pavel Krmer 1 Data Science Summer School @ Uni Vienna 1 Dept. of

Keras input and dense layers ADVAN CED DEEP LEARN IN G W ITH K ERAS Zach Deane-Mayer Data

Software Engineering Software Applications A.Y. 2020/2021 What is software engineering? What is

Principles of Software Construction: Objects, Design, and Concurrency Design for large-scale

Software Libraries for PGMs Kevin Rothi Prepared for Dr. Rina Dechters Spring 2018 UCI ICS 276

DC CFAR VIRTUAL LAB OPEN HOUSE MULTIPARAMETRIC FLOW CYTOMETRY May 22, 2020 What is

Scalable and Distributed DNN Training on Modern HPC Systems: Challenges and Solutions Keynote

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern

The Software Life Cycle Elaboration Production Software Engineering Deployment Modelling

1 If it's not merely programming Life Cycle What is it? Software life cycle Life Cycle

HyPar-Flow : Exploiting MPI and Keras for Scalable Hy brid- Par allel - PowerPoint PPT Presentation

HyPar-Flow : Exploiting MPI and Keras for Scalable Hy brid- Par allel DNN Training with Tensor Flow Ammar Ahmad Awan, Arpan Jain , Quentin Anthony, Hari Subramoni, and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) Dept. of

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Hy Hybr brid id CoE CoE The Eu e Europ opean ean Ce Centr tre e of of Ex Excel

8 2 J O H N S TREET - TW O S TO RY A D D ITIO N 1 8 BRID G E S TREET - EN TRY A W N IN G

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Keras: Performance Analysis of Tensorflow, Theano, and CNTK Backends R244 Presentation By:

Deep Neural Nets and Keras Pavel Krmer 1 Data Science Summer School @ Uni Vienna 1 Dept. of

Keras input and dense layers ADVAN CED DEEP LEARN IN G W ITH K ERAS Zach Deane-Mayer Data

Software Engineering Software Applications A.Y. 2020/2021 What is software engineering? What is

Principles of Software Construction: Objects, Design, and Concurrency Design for large-scale

Software Libraries for PGMs Kevin Rothi Prepared for Dr. Rina Dechters Spring 2018 UCI ICS 276

DC CFAR VIRTUAL LAB OPEN HOUSE MULTIPARAMETRIC FLOW CYTOMETRY May 22, 2020 What is

Scalable and Distributed DNN Training on Modern HPC Systems: Challenges and Solutions Keynote

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern

The Software Life Cycle Elaboration Production Software Engineering Deployment Modelling

1 If it's not merely programming Life Cycle What is it? Software life cycle Life Cycle

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards