Distributed DeepLearning at Scale Soumith Chintala Facebook AI - PowerPoint PPT Presentation

Distributed DeepLearning at Scale Soumith Chintala Facebook AI Research

Overview • Deep Learning Research at FAIR • Deep Learning on GPUs • Deep Learning at scale • Emerging Trends

Deep Learning Research at Facebook AI Research

Image Intelligence: Classification

Image Intelligence Language Translation from Visual Learning

Image Intelligence : Detection

Image Intelligence : Detection 1x1# conv# 56x56# 512x14x14# 512x1x1# VGG# f segm (x):#224x224# 2x2# 512x14x14# pool# f score (x):#1x1 # # x:#3x224x224# 512x7x7# 512x1x1# 1024x1x1#

Image Intelligence : Detection image scores

Image Intelligence : Detection image image scores scores

Image Intelligence : Detection

Image Intelligence https://code.facebook.com/posts/accessibility/

Video Intelligence

Image and Video Generation Predicting the Future

Natural Language Understanding chatbots, personal assistants • Memory networks • Language Translation • Reading, Writing and answering Questions

Deep Learning at Scale

Deep Learning at Scale GPU-powered Convolution Neural Networks

Deep Learning at Scale GPU-powered Convolution Neural Networks Alex Khrizevsky

Deep Learning at Scale GPU-powered Convolution Neural Networks • Convolutions, GEMM take all the time • Faster Convolutions = faster research

Deep Learning at Scale GPU-powered Convolution Neural Networks

Deep Learning at Scale GPU-powered Convolution Neural Networks Winograd transform based Convolutions

Deep Learning at Scale GPU-powered Convolution Neural Networks • The standard in deep learning: NVIDIA GPUs + CUDA + CuDNN

Deep Learning at Scale GPU-powered Convolution Neural Networks • Exotic new hardware! • Custom chips (Yunji Chen et. al., Nervana Systems)

Deep Learning at Scale Multi-GPU Training • Use multiple GPUs on single machine

Deep Learning at Scale Multi-GPU Training • Data parallel

Deep Learning at Scale Multi-GPU Training • Model parallel

Deep Learning at Scale Multi-GPU Training • Pipeline-parallel

Deep Learning at Scale Multi-GPU Training Bottleneck: interconnects

Deep Learning at Scale Multi-Machine Training • Multi-machine SGD Send gradients

Deep Learning at Scale Multi-Machine Training • Multi-machine SGD Send Weights

Deep Learning at Scale Multi-Machine Training • Elastic Averaging SGD! (Sixin Zhang, Anna Choromanska, Yann LeCun)

Deep Learning at Scale Multi-Machine Training • Elastic Averaging SGD! Train synchronously Occasionally, check with master Dont go too far from everyone else

Deep Learning at Scale Multi-Machine Training • Elastic Averaging SGD! Train synchronously Occasionally, check with neighbors Dont go too far from everyone else

Deep Learning at Scale Multi-Machine Training • Elastic Averaging SGD! • Empirical speedup of SquareRoot(N) • N = number of nodes • No communication overhead with pre-fetching • 128 GPUs (32 clients * 4 GPUs) • Sharded parameters over 64 CPU servers • Tau = 10, prefetch = 5 • zero overhead

Deep Learning at Scale Multi-Machine Training • Elastic Averaging SGD! • Fun fact: Trained AlexNet in 5 epochs of Imagenet data • Good success in training Vision and Text networks

Big Sur Open Compute for Deep Learning • Serviceability • Thermal Efficiency • Performance

Big Sur Hot swappable fan modules Open Compute for Deep Learning Removable GPU baseboard GPU removal using 2 thumb screws Cables to change Swap PCI-e Topologies topologies with incredible ease Removable motherboard tray Rails for in-rack servicing 2.5” drive carriers

Big Sur PCI-e Topologies — Matter!

Emerging Trends

Emerging Trends E ffi cient Collectives + Imperative Programs • Data / Model / Pipeline parallel seems su ffi cient • Torch (nn / autograd / distlearn) • Ca ff e

Emerging Trends Computational Graph Toolkits • Intel CnC, Ca ff e, TensorFlow, MXNet, Theano • Graph placement hints + execution • DSLs to write the computation graphs

Silver Bullet Imperative Language + Graph Compiler • Best of both worlds • Hard problem of automatic graph placement • Limited heuristic-driven success

Presence at GTC 2016 If you want to chat in-person, drop us an email • Big Sur Hardware • Kevin Lee kevinlee@fb.com • Doug Wimer dwimer@fb.com • Soumith Chintala soumith@fb.com • Multi-GPU / Multi-machine Training Nicolas Vasilache ntv@fb.com • Je ff Johnson jhj@fb.com • Soumith Chintala soumith@fb.com • • Computation Graphs, Automatic Placement Je ff Johnson jhj@fb.com • Andrew Tulloch tulloch@fb.com • Yangqing Jia jiayq@fb.com • Soumith Chintala soumith@fb.com •

Distributed DeepLearning at Scale Soumith Chintala Facebook AI - PowerPoint PPT Presentation

Distributed DeepLearning at Scale Soumith Chintala Facebook AI Research Overview Deep Learning Research at FAIR Deep Learning on GPUs Deep Learning at scale Emerging Trends Deep Learning Research at Facebook AI Research Image

Motivation Large-scale distributed systems becoming more common multiple datacenters, cloud

Deep Learning for Self Driving Cars Link: deeplearning.mit.edu Lex Fridman: fridman@mit.edu

Large-Scale Systems: WebOS Access to geographically distributed data-dissemination and

Early diagnosis of Alzheimer with DeepLearning Student : Supervisor: Ivancich Stefano Nanni

Experimental Performability Evaluation of Middleware for Large-Scale Distributed Systems L.

CSE 452 Distributed Systems Arvind Krishnamurthy Distributed Systems How to make a set of

CSE 452 Distributed Systems Tom Anderson Distributed Systems How to make a set of computers

Distributed Training for Large-scale Logistic Models Siddharth Gopal Carnegie Mellon Univeristy

Upscaling Beyond SuperResolution Using a Novel DeepLearning System Pablo Navarrete

PAT HELLAND AND ME HOW TO BUILD STATEFUL DISTRIBUTED APPLICATIONS THAT CAN SCALE ALMOST INFINITELY

Simulating Energy Aware Networks in Large Scale Distributed Systems Betsegaw Lemma Amersho

Multi-scale Analysis of Large Distributed Computing Systems Lucas M. Schnorr, Arnaud Legrand,

CSE 452 Distributed Systems Arvind Krishnamurthy Ellis Michael Distributed Systems How

FROM INVERTER STANDARDS TO UNDERSTANDING INVERTER BEHAVIOUR FOR SMALL-SCALE DISTRIBUTED

Large-scale Evaluation of Distributed Attack Detection Thomas Gamer, Christoph P. Mayer Institut

A Distributed and Parallel Asynchronous Unite and Conquer Method to Solve Large Scale

Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative SMT

HiPS Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning

Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Scale and Performance in a Distributed File System (AFS) Howard et al. CMU 1988, ACM TOCS

Large Scale Knowledge Representation of Large Scale Knowledge Representation of Distributed

Big data management and predictive analytics for customer transactions Large Scale Distributed