Efficient Communication Library for Large-Scale Deep Learning Mar - PowerPoint PPT Presentation

IBM Research AI Efficient Communication Library for Large-Scale Deep Learning Mar 26, 2018 Minsik Cho (minsikcho@us.ibm.com)

Deep Learning changing Our Life Automotive/transportation Medicine and Biology Security/public safety Consumer Web, Mobile, Retail Media and Entertainment 2

IBM Deep Learning Workflow Latency to model: Typically days to train complex models This is my focus Limited by training compute throughput Training Forward Conversion/Retraining Trained Needed if training/inference precisions differ Error Model Training Data Backward (grouped in large minibatches) Next Minibatch Inference Next Epoch Model Inference Batching Forward Action Smaller, varied batch size: Application-dependent Application- Input Data dependent Individual E.g., from microservices Latency to action: Typically ms to complete full inference workflow Limited by latency of batching (to enable efficient inference) + inference compute+ resultant action 3 3

Advance in Computation for Deep Learning [P. Goldsborough] [MichaelGalloy.com] /FPGA • 10-100 TFLOPS • Very good scaling for last 15 years 4

Motivation: Ok, ever-fast computation. Is this enough? • ImageNet1K : 1.2M images, 1K classes, Resnet101 – Batch-size = 32 (limited by GPU memory) – Iteration time = 300ms – #iterations per epoch = 38000 – Total training time for 100 epoch = 13.2 days • ImageNet22K : 7.5M images, 22K classes , Resnet101 – Total training time for 100 epoch = 35.2 days • No, it is NOT – 1.2M samples are still at toy scale – Computation scaling is not fast enough • the data explosion/model complexity • Innovation will take too long, or even stop at some point – I cannot wait for days to get my model trained! 5

Faster Training Time with Distributed Deep Learning 9 Days Recognition What will you do? 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours Iterate more and create more accurate models? Create more models? Both? Recognition 54x Learning runs with Power 8 6

Distributed Deep Learning Gradient/weight (10MB-1GB) [P. Goldsborough] Anything Data parallelism : Parm-Server Model parallelism (complex partitioning) Data parallelism : Allreduce 7

Communication : Overhead sync sync 32 images 32 images 32 images 32 images 32 images 32 images 32 images 32 images 32 images 32 images • In weak-scaling – Computation cost remains constant – Communication cost increases with more learners/GPUs • Computation /Communication is the key for large-scale deep learning – Increase Computation – Faster Communication 8

Advance in Communication for Deep Learning • Still scaling, but not fast enough – Computation is still ahead – Data perhaps grows much faster 9

Designing Large-scale Deep Learning Computation Communication Model depth Gradient count GPU throughput Network BW Good balance Faster algorithm Faster algorithm Mini-batch size • Model/Application • Deeper/wider model to increase compute time • Smaller gradient count to reduce communication time • System • Balance network and computing resources • Select mini-batch size to adjust the ratio • Larger mini-batch size to lower the ratio • Too big mini-batch size can hurt convergence and accuracy • Network-topology aware communication 10

IBM PowerAI DDL (Distributed Deep Learning Library) • Collaborative communication library for Distributed Deep Learning – MPI-like interface for easy-integration – Enables deep learning software to scale to 100s severs with CPU/GPUs – Works across variety of system sizes – Works with variety of network types, switch topologies • DDL orchestrates the data communication – Plan efficient communication pattern on a hierarchal network environment – Actual point-point data transfer via NCCL or MPI • Currently integrated into – Supported: Caffe, Tensorflow, Chainer, Torch – Ongoing : Caffe2, PyTorch, Keras (TF-backend) • Currently US patent-pending 11

DDL : Topology-aware communication Max bandwidth: 10 Gbytes/sec switch0 switch1 Max sustained bandwidth: 100 Gbytes/sec A B D C • Example A->B B->C C->D – A, B, C, D broadcast to all others B->C C->D D->A – Suffers from congestion – Suffers from lower BW C->D D->A A->B D->A A->B B->C 12

DDL : Topology-aware communication A->B B->C C->D switch0 switch1 B->C C->D D->A suboptimal C->D D->A A->B box0 box1 box2 box3 D->A A->B B->C A B D C • It’s a mapping problem A->B B->C C->D – System-specific network B->A A->D – Application-specific traffic D->C Optimal • DDL does differently C->D D->A A->B – To minimize bus contention D->C C->B B->A – To minimize crossing lower BW 13

DDL : Problem Definition and Solution • Assumption – network topology with various bandwidths • Problem Definition – min-cost multi-commodity flow problem – NP-hard problem but can be solved easily if graph size is small (ie 4 vertices) • DDL solves a typical case/topology offline – if the cluster/cloud has provide such topology, it performs very well 14

DDL : How well it performs on Caffe2 DDL DDL • 48 IBM S822LC with PPC64LE RHEL – 3 racks and 16 hosts on each, connected though 10GB/s IB – Each host has 4 P100-SXM2 with CUDA8, CUDNN5 • Comparing algorithms on Resnet50 + Imagenet1K (preloaded to RAMDisk) mbs=32 – MPI_Allreduce – Ring (all-reduce from Baidu in Feb 2017) – GLOO (from Facebook) : NCCL+ib_verb 15

Comparison with NCCL 2.1.x Allreduce (POWER) Exploiting in-system topology Exploiting in/cross-system topology • IBM P9 Newell Systems (NVLink) with V100s • 100Gbps InfiniBand 16

Comparison with NCCL 2.1.x Allreduce (X86) NO in-system topology Exploiting cross-system topology • X86 Systems (PCIe) with P100s • 10Gbps Ethernet 17

Conclusion • DDL is a topology-aware communication library in PowerAI • DDL delivers the industry-best performance with – Network hierarchy – Multi-tier bandwidth • DDL is suitable for common distributed training on cloud environment 18

BACKUP 19

Efficient Communication Library for Large-Scale Deep Learning Mar - PowerPoint PPT Presentation

IBM Research AI Efficient Communication Library for Large-Scale Deep Learning Mar 26, 2018 Minsik Cho (minsikcho@us.ibm.com) Deep Learning changing Our Life Automotive/transportation Medicine and Biology Security/public safety Consumer Web,

Library Department FY 2021 Library Department FY 2021 Library Organization Chart Springfield

Presentation 7.3b: Multiple linear regression Murray Logan 09 Aug 2016 library (GGally) library

AAPoly Library Orientation Library Contacts Phone : 61 3 8610 4132 Email : library@aapoly.edu.au

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

The Homeschooling - Library Connection Diane Pamel- Library Director Southworth Library and

Eric Lashley Library Director, Georgetown Public Library (TX) Patrick Lloyd, LMSW Community

Session 12 Assessing and Developing Communication SECTION 4: 1 Communication Communication

Library RMR Project Renovate, Modernize, Reorganize Library Serves Patrons of Every Age This

King Fahd University of Petroleum & Minerals Deanship of Library Affairs KFUPM Library

PopUp Library @ Senior Center Whats a PopUp Library? Library services somewhere that is

What do you do with the temporarily placed programs? The problem is more widespread than just

The Chester Community Library The Chester Community Library The Chester Community Library The

Wolfner Talking Book and Braille Library That All May Read Wolfner Library About Wolfner

Library Services in Guatemala and Utah: Applying lessons learned abroad in library outreach and

Standard Cell Library/Library Exchange Format (LEF) Advanced VLSI Design CMPE 641 Library

Linking library data: contributions and role of subject data Nuno Freire The European Library

2018-2019 Dynamo/Dash League July 21, 2018 South Texas Youth Soccer Association Summer GBM

ASX Release 25 July 2007 BBW PRESENTATION TO CITIGROUP INAUGURAL CLIMATE CHANGE CONFERENCE

Not to Cry Wolf: Distantly Supervised Multitask Learning in Critical Care Patrick Schwab 1

Handout and presentation maps from P. Kozelka, USEPA TAC meeting 31 January 2006 Table 2.x

Plymouth Transportation & Visitors Services Center May 6, 2014 Town of Plymouth Board of

Employees @ Tata Power-DDL WEPOWER About the Organization: Tata Power-DDL Came into Existence,

NON-STOP TATA POWER DELHI DISTRIBUTION LIMITED To be the most trusted and admired provider of

Energy Efficiency & ESCO Services SUJAY KUMAR SAHA Head of Group Demand Side Management

Efficient Communication Library for Large-Scale Deep Learning Mar - PowerPoint PPT Presentation

IBM Research AI Efficient Communication Library for Large-Scale Deep Learning Mar 26, 2018 Minsik Cho (minsikcho@us.ibm.com) Deep Learning changing Our Life Automotive/transportation Medicine and Biology Security/public safety Consumer Web,

Library Department FY 2021 Library Department FY 2021 Library Organization Chart Springfield

Presentation 7.3b: Multiple linear regression Murray Logan 09 Aug 2016 library (GGally) library

AAPoly Library Orientation Library Contacts Phone : 61 3 8610 4132 Email : library@aapoly.edu.au

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

The Homeschooling - Library Connection Diane Pamel- Library Director Southworth Library and

Eric Lashley Library Director, Georgetown Public Library (TX) Patrick Lloyd, LMSW Community

Session 12 Assessing and Developing Communication SECTION 4: 1 Communication Communication

Library RMR Project Renovate, Modernize, Reorganize Library Serves Patrons of Every Age This

King Fahd University of Petroleum &amp; Minerals Deanship of Library Affairs KFUPM Library

PopUp Library @ Senior Center Whats a PopUp Library? Library services somewhere that is

What do you do with the temporarily placed programs? The problem is more widespread than just

The Chester Community Library The Chester Community Library The Chester Community Library The

Wolfner Talking Book and Braille Library That All May Read Wolfner Library About Wolfner

Library Services in Guatemala and Utah: Applying lessons learned abroad in library outreach and

Standard Cell Library/Library Exchange Format (LEF) Advanced VLSI Design CMPE 641 Library

Linking library data: contributions and role of subject data Nuno Freire The European Library

2018-2019 Dynamo/Dash League July 21, 2018 South Texas Youth Soccer Association Summer GBM

ASX Release 25 July 2007 BBW PRESENTATION TO CITIGROUP INAUGURAL CLIMATE CHANGE CONFERENCE

Not to Cry Wolf: Distantly Supervised Multitask Learning in Critical Care Patrick Schwab 1

Handout and presentation maps from P. Kozelka, USEPA TAC meeting 31 January 2006 Table 2.x

Plymouth Transportation &amp; Visitors Services Center May 6, 2014 Town of Plymouth Board of

Employees @ Tata Power-DDL WEPOWER About the Organization: Tata Power-DDL Came into Existence,

NON-STOP TATA POWER DELHI DISTRIBUTION LIMITED To be the most trusted and admired provider of

Energy Efficiency &amp; ESCO Services SUJAY KUMAR SAHA Head of Group Demand Side Management

King Fahd University of Petroleum & Minerals Deanship of Library Affairs KFUPM Library

Plymouth Transportation & Visitors Services Center May 6, 2014 Town of Plymouth Board of

Energy Efficiency & ESCO Services SUJAY KUMAR SAHA Head of Group Demand Side Management