HyPar-Flow : Exploiting MPI and Keras for Scalable Hy brid- Par allel DNN Training with Tensor Flow Ammar Ahmad Awan, Arpan Jain , Quentin Anthony, Hari Subramoni, and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) Dept. of Computer Science and Engineering , The Ohio State University {awan.10, jain.575, anthony.301, subramoni.1, panda.2}@osu.edu
Agenda • Introduction and Motivation • Problems and Challenges • Key Contribution • Performance Characterization • Conclusion Network Based Computing Laboratory High-Performance Deep Learning 2 CSE 5194
The Deep Learning (DL) Revolution Deep Learning – A technique to achieve Artificial Intelligence • – Uses Deep Neural Networks Machine Deep Learning AI Learning (DL) (ML) Examples: Examples: MLPs, DNNs, Logistic Regression Adopted from: http://www.deeplearningbook.org/contents/intro.html Sourc rce: https://thenewstack. k.io io/demys ystif ifyin ying-dee eep-le learnin ing-an and-ar artificial al-in intellig lligence/ Network Based Computing Laboratory High-Performance Deep Learning 3 CSE 5194
Accelerator/CP Family Deep Learning meets Super Computers Performance Share • NVIDIA GPUs - major force for accelerating DL workloads – Comput putationa nal r requi quirement nt is increasing ng e expo pone nent ntially www.top500.org Courtesy: https://openai.com/blog/ai-and-compute/ Network Based Computing Laboratory High-Performance Deep Learning 4 CSE 5194
How to make Training Faster? • Data parallelism – Horovod: TensorFlow, PyTorch, and MXNet – TensorFlow: tf.distribute.Strategy API – PyTorch: torch.nn.parallel.DistributedDataParallel • Model-parallelism and Hybrid-parallelism – No framework-level support – Only LBANN supports it within the framework – Higher-level frameworks: Gpipe, Mesh-TensorFlow, etc. Network Based Computing Laboratory High-Performance Deep Learning 5 CSE 5194
Distributed/Parallel Training Strategies for DNNs • Data Parallelism (most common) • Model and Hybrid Parallelism (emerging) • ‘X’-Parallelism – ‘X’—> Spatial, Channel, Filter, etc. Data Parallelism Model Parallelism Hybrid (Model and Data) Parallelism Courtesy: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks Network Based Computing Laboratory High-Performance Deep Learning 6 CSE 5194
Why Model Parallelism? • Data Parallelism – only for models that fit the memory • Out-of-core models – Deeper model Better accuracy but more memory required! • Model parallelism can work for out-of-core models! • Designing a system for model- parallelism is challenging Network Based Computing Laboratory High-Performance Deep Learning 7 CSE 5194
Agenda • Introduction and Motivation • Problems and Challenges • Key Contribution • Performance Characterization • Conclusion Network Based Computing Laboratory High-Performance Deep Learning 8 CSE 5194
Major Problems • Defining a distributed model -- necessary but difficult – requires knowledge of the model, communication library, and distributed hardware • Implementing distributed forward/back-propagation – needed because partitions reside in different memory spaces and need explicit communication • Obtaining parallel speedup on an inherently sequential task – forward pass followed by a backward pass – Limited opportunity for parallelism and scalability • Achieving scalability without losing out on a model’s accuracy – Valid concern for all types of parallelism strategies Network Based Computing Laboratory High-Performance Deep Learning 9 CSE 5194
Research Challenges Challenge-2: Challenge-3: Challenge-1: Model- Communication Applying HPC Definition APIs and between Partitions and Techniques to Framework-specific Replicas Improve Features Performance Meet HyPar-Flow! Network Based Computing Laboratory High-Performance Deep Learning 10 CSE 5194
Agenda • Introduction and Motivation • Problems and Challenges • Key Contribution • Performance Evaluation • Conclusion Network Based Computing Laboratory High-Performance Deep Learning 11 CSE 5194
Key Contribution: Propose, Design, and Evaluate HyPar-Flow • HyPar-Flow is practical (easy-to-use) and high-performance (uses MPI) – Based on Keras models and exploits TF 2.0 Eager Execution – Leverages performance of MPI pt-to-pt. and collectives for communication Network Based Computing Laboratory High-Performance Deep Learning 12 CSE 5194
HyPar-Flow: Overview Network Based Computing Laboratory High-Performance Deep Learning 13 CSE 5194
HyPar-Flow: Components • Model Generator is crucial for productivity • Load Balancer is crucial for performance • Trainer – Core of Back-propagation – Several system-level challenges – Communication of tensors – Blocking or non-blocking – Efficient pipelining is needed • Communication Engine – Isolate communication interfaces – Unified Data, Model, and Hybrid Parallelism Network Based Computing Laboratory High-Performance Deep Learning 14 CSE 5194
Special Handling for Models with Skip Connections Network Based Computing Laboratory High-Performance Deep Learning 15 CSE 5194
Agenda • Introduction and Motivation • Problems and Challenges • Key Contribution • Performance Characterization • Conclusion Network Based Computing Laboratory High-Performance Deep Learning 16 CSE 5194
Evaluation Setup • 3 Systems – Frontera at Texas Advanced Computing Center (TACC) – Stampede2 (Skylake partition) at TACC – AMD EPYC: Local system with dual-socket AMD EPYC 7551 32-core processors. • Interconnect – Frontera -- Mellanox InfiniBand HDR- 100 HCAs – Stampede2 -- Intel Omni-Path HFIs. • TensorFlow v1.13, MVAPICH2 2.3.2 on Frontera and Epyc, and Intel MPI 2018 on Stampede2 • We use and modify model definitions for VGG and ResNet(s) from keras.applications Network Based Computing Laboratory High-Performance Deep Learning 17 CSE 5194
Verifying the Correctness of HyPar-Flow • The following variants have been compared: – SEQ (GT) - Sequential using tf.GradientTape (GT). – SEQ (MF) - Sequential using model.fit (MF). – SEQ (MF-E) - Sequential using model.fit (MF) and (E)ager Execution. – HF-MP (2)/(56) - HyPar-Flow model-parallel with 2/48 model-partitions. VGG-16 ResNet-110 ResNet-1k Network Based Computing Laboratory High-Performance Deep Learning 18 CSE 5194
Model/Hybrid Parallelism on single/two nodes • ResNet-1k -- scales with batch size on one node as well as two nodes • Reason for scaling? – Counter-intuitive for model-parallelism to scale better than data-parallelism – Poor CPU implementation? Network Based Computing Laboratory High-Performance Deep Learning 19 CSE 5194
Hybrid Parallelism for AmoebaNet • AmoebaNet -- different architecture compared to ResNet(s) • More branches and skip connections • Scales well using HyPar-Flow • Memory-hungry so single node restricted to BatchSize=64 Network Based Computing Laboratory High-Performance Deep Learning 20 CSE 5194
HyPar-Flow (HF): Flexibility and Scalability • CPU based results – AMD EPYC – Intel Xeon • Excellent speedups for – VGG-19 – ResNet-110 – ResNet-1000 (1k layers) • Able to train “future” models – E.g. ResNet-5000 (a synthetic 5000-layer model we benchmarked) 110x speedup on 128 Intel Xeon Skylake nodes (TACC Stampede2) Network Based Computing Laboratory High-Performance Deep Learning 21 CSE 5194
HyPar-Flow at Scale (512 nodes on TACC Frontera) • ResNet-1001 with variable batch size • Approach: – 48 model-partitions for 56 cores – 512 model-replicas for 512 nodes – Total cores: 48 x 512 = 24,576 • Speedup – 253X on 256 nodes – 481X on 512 nodes • Scaling Efficiency – 98% up to 256 nodes – 93.9% for 512 nodes 481x speedup on 512 Intel Xeon Skylake nodes (TACC Frontera) Network Based Computing Laboratory High-Performance Deep Learning 22 CSE 5194
Agenda • Introduction and Motivation • Problems and Challenges • Key Contribution • Performance Evaluation • Conclusion Network Based Computing Laboratory High-Performance Deep Learning 23 CSE 5194
Conclusion • In-depth analysis of Data/Model/Hybrid parallelism – The need for model/hybrid parallelism -- larger models • Proposed and Designed HyPar-Flow – Flexible and user-transparent system – Leverages existing technologies instead of reinventing anything – Keras, TensorFlow, and MPI for flexibility and scalability • Performance Evaluation on large systems – Three HPC clusters including Frontera at TACC (#5 on Top500) – Three DNNs with diverse requirements and sizes (VGG, ResNet-110/1k, and AmoebaNet) – 93% scaling efficiency on 512 nodes (Frontera) Network Based Computing Laboratory High-Performance Deep Learning 24 CSE 5194
Recommend
More recommend