scalable and distributed dnn training on modern hpc
play

Scalable and Distributed DNN Training on Modern HPC Systems: - PowerPoint PPT Presentation

Scalable and Distributed DNN Training on Modern HPC Systems: Challenges and Solutions Keynote Talk at HPML 19 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda


  1. Scalable and Distributed DNN Training on Modern HPC Systems: Challenges and Solutions Keynote Talk at HPML ‘19 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

  2. Understanding the Deep Learning Resurgence • Deep Learning is a sub-set of Machine Learning – But, it is perhaps the most radical and revolutionary subset – Automatic feature extraction vs. hand-crafted features • Deep Learning – A renewed interest and a lot of hype! – Key success: Deep Neural Networks (DNNs) – Everything was there since the late 80s except the “ computability of DNNs ” Courtesy: http://www.deeplearningbook.org/contents/intro.html Network Based Computing Laboratory HPML (May ‘19) 2

  3. Deep Learning Use Cases and Growth Trends Courtesy: https://www.top500.org/news/market-for-artificial-intelligence-projected-to-hit-36-billion-by-2025/ Network Based Computing Laboratory HPML (May ‘19) 3

  4. Increasing Usage of HPC, Big Data and Deep Learning Big Data HPC (Hadoop, Spark, (MPI, RDMA, HBase, Lustre, etc.) Memcached, etc.) Convergence of HPC, Big Deep Learning Data, and Deep Learning! (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the Cloud!! Network Based Computing Laboratory HPML (May ‘19) 4

  5. Newer Workflows - Deep Learning over Big Data (DLoBD) • Deep Learning over Big Data ( DLoBD ) is one of the most efficient analyzing paradigms • More and more deep learning tools or libraries (e.g., Caffe, TensorFlow) start running over big data stacks, such as Apache Hadoop and Spark • Benefits of the DLoBD approach – Easily build a powerful data analytics pipeline • E.g., Flickr DL/ML Pipeline, “ How Deep Learning Powers Flickr ”, http://bit.ly/1KIDfof (3) Non-deep (1) Prepare (2) Deep (4) Apply ML learning Datasets @Scale Learning @Scale model @Scale analytics @Scale – Better data locality – Efficient resource sharing and cost effective Network Based Computing Laboratory HPML (May ‘19) 5

  6. Drivers of Modern HPC Cluster Architectures Accelerators / Coprocessors High Performance Interconnects - high compute density, high InfiniBand SSD, NVMe-SSD, NVRAM performance/watt Multi-core Processors <1usec latency, 200Gbps Bandwidth> >1 TFlop DP on a chip • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc. K - Computer Summit Sunway TaihuLight Sierra Network Based Computing Laboratory HPML (May ‘19) 6

  7. Key Phases of Deep Learning • Deep Learning has two major tasks 1. Training of the Deep Neural Network 2. Inference (or deployment) that uses a trained DNN • DNN Training – Training is a compute/communication intensive process – can take days to weeks – Faster training is necessary! • Faster training can be achieved by – Using Newer and Faster Hardware – But, there is a limit! – Can we use more GPUs or nodes? • The need for Parallel and Distributed Training Network Based Computing Laboratory HPML (May ‘19) 7

  8. Scale-up and Scale-out Desired • Scale-up : Intra-node Communication – Many improvements like: NCCL2 Scale-up Performance • NVIDIA cuDNN, cuBLAS, NCCL, etc. cuDNN MPI • CUDA 9 Co-operative Groups • Scale-out : Inter-node Communication MKL-DNN – DL Frameworks – most are optimized for single-node only – Distributed (Parallel) Training is an emerging gRPC trend • OSU-Caffe – MPI-based Hadoop • Microsoft CNTK – MPI/NCCL2 • Google TensorFlow – gRPC-based/MPI/NCCL2 • Facebook Caffe2 – Hybrid (NCCL2/Gloo/MPI) Scale-out Performance Network Based Computing Laboratory HPML (May ‘19) 8

  9. Holistic Evaluation is Important!! • My framework is faster than your framework! • This needs to be understood in a holistic way. • Performance depends on the entire execution environment (the full stack) • Isolated view of performance is not helpful A. A. Awan, H. Subramoni, and Dhabaleswar K. Panda. “An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures”, In Proceedings of the Machine Learning on HPC Environments (MLHPC'17). ACM, New York, NY, USA, Article 8. Network Based Computing Laboratory HPML (May ‘19) 9

  10. Broad Challenge: Exploiting HPC for Deep Learning How to efficiently scale-out a Deep Learning (DL) framework and take advantage of heterogeneous High Performance Computing (HPC) resources? Network Based Computing Laboratory HPML (May ‘19) 10

  11. Research Challenges to Exploit HPC Technologies 1. What are the fundamental 1 Deep Learning and Machine Learning Frameworks issues in designing DL Caffe/ frameworks ? CNTK Caffe2 TensorFlow MXNet OSU-Caffe – Memory Requirements Major Computation and Communication Phases in DL Frameworks – Computation Forward Gradient Model Propagation Backward Aggregation Requirements – Communication Overhead 2. Why do we need to support Communication Runtimes to support 2 Distributed Training distributed training ? – To overcome the limits of single-node training HPC Platforms CPU InfiniBand GPU – To better utilize hundreds of existing HPC Clusters Network Based Computing Laboratory HPML (May ‘19) 11

  12. Research Challenges to Exploit HPC Technologies (Cont’d) 3. What are the new design challenges brought forward by DL frameworks for Deep Learning and Machine Learning Frameworks Communication runtimes? Caffe/ CNTK Caffe2 TensorFlow MXNet – Large Message Collective OSU-Caffe Communication and Reductions Major Computation and Communication Phases in DL Frameworks – GPU Buffers ( CUDA-Awareness ) Forward Gradient Model Propagation Backward Aggregation 4 Co-Design 4. Can a Co-design approach help in Opportunities achieving Scale-up and Scale-out efficiently? Communication Runtimes (MPI/NCCL/Gloo/MLSL) – Co-Design the support at Runtime Point-to- Large-message CUDA- 3 Point level and Exploit it at the DL Awareness Collectives Operations Framework level – What performance benefits can HPC Platforms be observed? CPU InfiniBand GPU – What needs to be fixed at the communication runtime layer? Network Based Computing Laboratory HPML (May ‘19) 12

  13. Multiple Approaches taken up by OSU • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Co-designing Deep Learning Stacks with High-Performance MPI • Out-of-core DNN training • Accelerating TensorFlow on HPC Systems • Accelerating Big Data Stacks • Efficient Deep Learning over Big Data Network Based Computing Laboratory HPML (May ‘19) 13

  14. Data Parallel Deep Learning and MPI Collectives Loop {} packed_comm_buff • Major MPI Collectives MPI_Bcast (GPU 0) 1. Data involved in Designing Propagation distributed frameworks Params Params Params Params GPU 3 GPU 1 GPU 0 GPU 2 • MPI_Bcast – required for DNN parameter exchange L 1 L 1 L 1 L 1 L 2 L 2 L 2 L 2 • MPI_Reduce – needed for 2. Forward F F F F B B B B .. .. .. .. Backward gradient accumulation L n L n L n L n Pass from multiple solvers packed_red packed_red packed_red packed_red uce_buff uce_buff uce_buff uce_buff • MPI_Allreduce – use just one Allreduce instead of MPI_Reduce (GPU 0) 3. Gradient Gradients Aggregatio Reduce and Broadcast ApplyUpdates n A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17) Network Based Computing Laboratory HPML (May ‘19) 14

  15. Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 3,000 organizations in 88 countries – More than 539,000 (> 0.5 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘18 ranking) 3 rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China • 14 th , 556,104 cores (Oakforest-PACS) in Japan • 17 th , 367,024 cores (Stampede2) at TACC • • 27 th , 241,108-core (Pleiades) at NASA and many others – Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC) – http://mvapich.cse.ohio-state.edu Partner in the upcoming TACC Frontera System • Empowering Top500 systems for over a decade Network Based Computing Laboratory HPML (May ‘19) 15

Recommend


More recommend