Distributed Deep Learning: Methods and Resources Sergey Nikolenko - PowerPoint PPT Presentation

Distributed Deep Learning: Methods and Resources Sergey Nikolenko Maxim Prasolov Chief Research Officer, Neuromation CEO, Neuromation Researcher, Steklov Institute of Mathematics at St. Petersburg September 23, 2017, AI Ukraine, Kharkiv

Outline ● Bird’s eye overview of deep learning SGD and how to parallelize it ● ● Data parallelism and model parallelism Neuromation: developing a worldwide marketplace ● for knowledge mining

● 10 years ago machine learning underwent a deep learning revolution ● Neural networks are one of the oldest techniques in ML ● But since 2007-2008, we can train large and deep neural networks (in part due to distributed computations) ● And now deep NNs yield state of the art results in many fields

What is a deep neural network ● A deep neural network is a huge composition of simple functions implemented by artificial neurons ● Usually linear combination followed by nonlinearity, but can be anything as long as you can take derivatives ● These functions are combined into a computational graph that computes the loss function for the model

Backpropagation ● To train the model (learn the weights), you take the gradient of the loss function w.r.t. weights ● Gradients can be efficiently computed with backpropagation ● And then you can do (stochastic) gradient descent and all of its wonderful modifications, from Nesterov momentum to Adam

Gradient descent is used for all kinds of neural networks FEEDFORWARD NETWORKS CONVOLUTIONAL NETWORKS RECURRENT NETWORKS

Distributed Deep Learning: The Problem ● One component of the DL revolution was the use of GPUs ● GPUs are highly parallel (hundreds of cores) and optimized for matrix computations ● Which is perfect for backprop (and fprop too) ● But what if your model does not fit on a GPU? Or what if you have multiple GPUs? ● ● Can we parallelize further?

What Can Be Parallel ● Model parallelism vs. data parallelism ● We will discuss both Data parallelism is much more common ● ● And you can unite the two: [pictures from (Black, Kokorin, 2016)]

Examples of data parallelism ● Make every worker do its thing and then average the results ● Parameter averaging : average w from all workers but how often? ○ ○ and what do we do with advanced SGD variants? ● Asynchronous SGD : average updates from workers ○ much more interesting without synchronization ○ but the stale gradient problem Hogwild (2011): very simple asynchronous SGD, ● just read and write to shared memory, lock-free; whatever happens, happens

Examples of data parallelism ● FireCaffe : DP on a GPU cluster ○ ○ communication through reduction trees

Model parallelism ● In model parallelism, different weights are distributed ● Pictures from the DistBelief paper (Dean et al., 2012) Difference in communication: ● ○ DP: workers exchange weight updates ○ MP: workers exchange data updates DP in DistBelief: Downpour SGD ● vs. Sandblaster L-BFGS Now, DistBelief has been ● completely replaced by...

Distributed Learning in TensorFlow ● TensorFlow has both DP (right) and MP (bottom) ● Workers and parameter servers MP usually works as a pipeline between layers: ●

Example of Data Parallelism in TensorFlow First specify the structure of the cluster: Then assign (parts of) computational graph to workers and weights to parameter servers:

Interesting variations ● (Zhang et al., 2016) – staleness-aware SGD : add weights depending on the time (staleness) to updates ● Elephas : distributed Keras that runs on Spark ● (Xie et al., 2015) – sufficient factor broadcasting : represent and send only u and v (Zhang et al., 2017) – Poseidon: a new architecture with ● wait-free backprop and hybrid communication

Distributed reinforcement learning ● Special mention: reinforcement learning; async RL is great! ● And standard (by now) DQN tricks are perfect for parallelization: ○ experience replay: store experience in replay memory and serve them for learning target Q-network is separate from the ○ Q-network which is learning now, updates are rare

Gorila from DeepMind : everything is parallel and asynchronous

Recap ● Data parallelism lets you process lots of data in parallel, copying the model ● Model parallelism lets you break down a large model into parts ● Distributed architectures are usually based on parameter servers and workers Especially in reinforcement learning, where distributed architectures rule ● ● And this all works out of the box in TensorFlow and other modern frameworks Distributed deep learning works ● But how is it relevant to us? Isn’t that for the likes of Google and/or DeepMind ? Where do we get the computational power and why do we need so much data? ●

BITCOIN OR AMAZON DEEP “BOTTLENECK” OF AUTOMATION ETHER MINING LEARNING OF EVERY INDUSTRY: NOT ENOUGH LABELED DATA $7-8 USD $3-4 USD FOR NEURAL NETWORK Chris per DAY HOUR TRAINING

BITCOIN OR AMAZON DEEP ETHER MINING LEARNING IMAGE RECOGNITION IN RETAIL $7-8 USD $3-4 USD per DAY HOUR

TO AUTOMATE THE RETAIL INDUSTRY 170.000 OBJECTS BITCOIN OR AMAZON DEEP ETHER MINING LEARNING MUST BE RECOGNIZED ON THE SHELVES OSA HP CONTRACTED NEUROMATION TO PRODUCE $7-8 USD $3-4 USD LABELED DATA AND TO RECOGNIZE per DAY HOUR MORE THAN 7 MLN EURO REVENUE 30% OF THE COST IS COMPUTATIONAL POWER. IMAGE RECOGNITION IN RETAIL BY ECR RESEARCH: ABOUT 40 BLN IMAGES PER YEAR

MORE THAN BITCOIN OR AMAZON DEEP 1 BLN ETHER MINING LEARNING $7-8 USD $3-4 USD Chris LABELED PHOTOS ARE REQUIRED per DAY HOUR TO TRAIN IMAGE RECOGNITION MODELS WHERE CAN WE GET THIS HUGE AMOUNT OF LABELED DATA?

DATA LABELING HAS BEEN MANUAL WORK TILL NOW 1 MAN = 8 HOURS x 50 IMAGES, Chris x $0.2 PER IMAGE YEARS OF MECHANICAL WORK 1 BLN LABELED PHOTOS = $240 MLN

WE KNOW HOW TO GENERATE SYNTHETIC LABELED DATA FOR DEEP LEARNING

SYNTHETIC DATA: A BREAKTHROUGH IN DEEP LEARNING ● Labeled data with 100% accuracy Automated data generation with no limits ● ● Cheaper and faster than manual labor BUT REQUIRES HUGE COMPUTATIONAL POWER TO RENDER DATA AND TRAIN NEURAL NETWORKS

BITCOIN OR ETHER AMAZON DEEP MINING LEARNING BITCOIN OR AMAZON DEEP The AI industry is ready to x 6 ETHER MINING LEARNING pay miners for their computational resources GPU $7-8 $3-4 GPU more than they can ever get $7-8 USD $3-4 USD per DAY per HOUR from mining Ether. per DAY HOUR WE CAN BRIDGE THIS GAP KNOWLEDGE MINING IS MORE PROFITABLE. DEEP LEARNING NEEDS YOUR COMPUTATION POWER!

BITCOIN OR AMAZON DEEP ETHER MINING LEARNING BLOCKCHAIN + DEEP LEARNING $7-8 USD $3-4 USD per DAY HOUR

NEUROMATION BITCOIN OR AMAZON DEEP PLATFORM ETHER MINING LEARNING will combine in one place all the components necessary to build deep learning solutions $7-8 USD $3-4 USD with synthetic data per DAY HOUR TokenAI THE UNIVERSAL MARKETPLACE OF NEURAL NETWORK DEVELOPMENT

BITCOIN OR AMAZON DEEP ETHER MINING LEARNING $7-8 USD $3-4 USD per DAY HOUR

NEUROMATION PLATFORM will be extending Etherium with TokenAI. Neuromation needs to quickly deploy a massive network of computation nodes (converted from crypto miners). the network must be geographically distributed and keep track of ● massive amounts of transactions ● the payment method for completed work should be highly liquid and politically independent ● network nodes have to understand the model of “mining” a resource for a bounty: transparency is required to build trust ● transactions need to be transparently auditable to prevent fraud and mitigate dispute Blockchain is the only technology that can realistically accomplish this. Extending Ethereum instead of building our own blockchain is an obvious first step .

DEEP LEARNING BITCOIN OR AMAZON DEEP RESEARCH GRANTS ETHER MINING LEARNING for R&D teams and start-up’s of DL/ML ● industry $7-8 USD $3-4 USD ● In cooperation with frontier institutions per DAY HOUR 1000 GPUs pool (+100,000GPU are ● coming) WE ARE OPEN FOR COOPERATION partnerships@neuromation.io

NEUROMATION LABS: RETAIL AUTOMATION PHARMA AND ENTERPRISE LAB BIOTECH LAB AUTOMATION LAB synthetic data for: synthetic data for: ( live ) ● medical imaging (classify +170 000+ items for the training flying drones, ● ● tumors and melanomas) Eastern European Retail self-driving cars, and ● health applications (smart Market only industrial robots in virtual cameras) ● about 50 euro per object environments contract for >7mln euro in manufacturing and ● ● revenue supply-chain solutions VAST APPLICATIONS OF SYNTHETIC DATA

OUR TEAM: Fedor Savchenko Sergey Nikolenko Maxim Prasolov Denis Popov Constantine Goltsev CTO Chief Research Officer CEO VP of Engineering Investor / Chairman Yuri Kundin Andrew Rabinovich Esther Katz Kiryl Truskovskyi Aleksey Spizhevoi ICO Compliance Adviser VP Communication Lead Researcher Researcher Adviser

ICO ROADMAP OCTOBER 15th, 2017 Presale of TOKENAI starts NOVEMBER, 2017 Public sale starts UNKNOWN DATE Secret cap is reached, and token sale ends in 7 days Jan 1st, 2018 Token sale ends (if secret cap is not reached)

Distributed Deep Learning: Methods and Resources Sergey Nikolenko - PowerPoint PPT Presentation

Distributed Deep Learning: Methods and Resources Sergey Nikolenko Maxim Prasolov Chief Research Officer, Neuromation CEO, Neuromation Researcher, Steklov Institute of Mathematics at St. Petersburg September 23, 2017, AI Ukraine, Kharkiv

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Distributed Synthetic Data Platform for Deep Learning Applications BITCOIN OR ETHER AMAZON DEEP

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Distributed Deep Learning Mathew Salvaris What will be covered Overview of Distributed

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

Distributed DeepLearning at Scale Soumith Chintala Facebook AI Research Overview Deep

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Assessing the Greatest Opportunity for Prevention of Occupational Cancer L Rushton 1 , T. Brown

Nuclear Energy: a New Beginning? - Findings from a recent MIT study - Jacopo Buongiorno TEPCO

National Space Agencies Facing New European Competences and Policies Augusto Cramarossa Italian

Final items Please email an updated final version of your notebook paper by 1. March 2020 to

Muon Accelerator Program Monthly Status Review December 14, 2012 Outline Introduction

Appendix 1 1 LBBD Staff - Eye Health Survey Results May 2015 Matthew Cole Name: Job title:

Fitting SVM models in Matlab mdl = fitcsvm(X,y) fit a classifier using SVM X is a

Computer Networks Kurtis Heimerl kheimerl@cs Vikram Iyer vsiyer@cs Qian (Will) Yan