Op#miza#on Challenges for Deep Learning Yoshua Bengio U. - PowerPoint PPT Presentation

Op#miza#on ¡Challenges ¡for ¡Deep ¡ Learning ¡ Yoshua ¡Bengio ¡ ¡ U. ¡Montreal ¡ ¡ December ¡12th, ¡2014 ¡ OPT’2014: ¡NIPS ¡Workshop ¡on ¡OpBmizaBon ¡for ¡ Machine ¡Learning ¡ ¡ ¡ ¡ ¡

De Deep Represe sentation Lear ni ning ng Learn ¡mulBple ¡levels ¡of ¡representaBon ¡ … ¡ h 3 ¡ of ¡increasing ¡complexity/abstracBon ¡ h 2 ¡ h 1 ¡ ¡ • ¡ ¡theory: ¡exponenBal ¡gain ¡ x ¡ • ¡ ¡brains ¡are ¡deep ¡ ¡ • ¡ ¡cogniBon ¡is ¡composiBonal ¡ ¡ • BeQer ¡mixing ¡(Bengio ¡et ¡al, ¡ICML ¡2013) ¡ • They ¡work! ¡SOTA ¡on ¡industrial-‑scale ¡AI ¡tasks ¡ (object ¡recogni#on, ¡speech ¡recogni#on, ¡ ¡ language ¡modeling, ¡music ¡modeling) ¡ ¡ 2 ¡

Deep Learning Ch Challenges s (B (Bengi engio, ar arxiv 1305.0445 445 Deep learning of represe sentations: s: looking forward) • ComputaBonal ¡Scaling ¡ • OpBmizaBon ¡& ¡UnderfiWng ¡ • Intractable ¡MarginalizaBon, ¡Approximate ¡ Inference ¡& ¡Sampling ¡ • Disentangling ¡Factors ¡of ¡VariaBon ¡ • Reasoning ¡& ¡One-‑Shot ¡Learning ¡of ¡Facts ¡ 3 ¡

Ch Challenge: Co Computational Scaling • Recent ¡breakthroughs ¡in ¡speech, ¡object ¡recogniBon ¡and ¡NLP ¡ hinged ¡on ¡faster ¡compuBng, ¡GPUs, ¡and ¡large ¡datasets ¡ • In ¡speech, ¡vision ¡and ¡NLP ¡applicaBons ¡we ¡tend ¡to ¡find ¡that ¡ ¡ as ¡Ilya ¡Sutskever ¡ would ¡say ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡BIGGER ¡IS ¡BETTER ¡ ¡ Because ¡deep ¡learning ¡is ¡ ¡ ¡EASY ¡TO ¡REGULARIZE ¡while ¡ ¡ ¡it ¡is ¡MORE ¡DIFFICULT ¡TO ¡AVOID ¡UNDERFITTING ¡ 5 ¡

We We st still h have a ve a l lon ong g way way to go to go in ra in raw co compu mputat ational nal p pow ower er 6 ¡

Co Computation / Ca Capacity Ra Ratio tio • N-‑grams, ¡decision ¡trees, ¡etc.: ¡ ¡poor ¡generalizaBon ¡but ¡capacity ¡ (and ¡memory) ¡can ¡grow ¡a ¡lot ¡while ¡computaBon ¡remains ¡ constant ¡or ¡grows ¡as ¡log(capacity). ¡ • Neural ¡nets ¡/ ¡deep ¡learning: ¡very ¡good ¡generalizaBon, ¡but ¡ computaBon ¡grows ¡linearly ¡with ¡capacity ¡(number ¡of ¡ parameters). ¡Each ¡parameter ¡is ¡used ¡for ¡every ¡example. ¡ • To ¡build ¡much ¡higher-‑capacity ¡models, ¡we ¡need ¡to ¡break ¡that ¡ linear ¡relaBonship ¡while ¡keeping ¡the ¡composiBonal ¡structure ¡ that ¡makes ¡deep ¡learning ¡generalize ¡so ¡well. ¡ 7 ¡

Machine Transl slation Exa Examp mples es • n-‑gram ¡based ¡English-‑French ¡MT: ¡~ ¡26 ¡Gbytes ¡(zipped), ¡80 ¡G ¡ unzipped? ¡ • Moses ¡phrase-‑based ¡baseline: ¡33.3 ¡BLEU ¡ • Edinburgh: ¡37 ¡BLEU ¡(using ¡very ¡large ¡LM ¡dataset) ¡ • SOTA ¡deep-‑learning ¡based ¡English-‑French ¡MT: ¡ • Montreal: ¡ • Single ¡model, ¡ ¡285M ¡(unzipped): ¡published ¡28.5 ¡BLEU, ¡latest ¡33.2 ¡BLEU ¡ • Google: ¡ • Single ¡large ¡model, ¡1.7G: ¡32.7 ¡BLEU ¡ • Ensemble ¡of ¡8 ¡models, ¡13.5G: ¡36.9 ¡BLEU ¡ 8 ¡

Ne New Resu sults on on De Deep Machine Transl slation • Handles ¡long ¡sentences ¡by ¡ introducing ¡an ¡aQenBon ¡ � � �� mechanism ¡ ¡ • Learns ¡to ¡choose ¡which ¡part ¡of ¡ � the ¡input ¡sentence ¡to ¡pay ¡most ¡ � �� aQenBon ¡to ¡when ¡predicBng ¡the ¡ � �� next ¡output ¡word, ¡as ¡a ¡funcBon ¡ � � � � � � � � of ¡the ¡output ¡RNN ¡state ¡and ¡ � � � � � � � � input ¡bi-‑RNN ¡state ¡ � � � � � � � � • Single ¡GPU ¡trained ¡over ¡2 ¡weeks ¡ 9 ¡

Pre Predic dicte ted (a) (b) Al Align ignmen ments ts 10 ¡ (c) (d)

Im Impr provem ement ents over Pure AE Model over Pure AE Model 30 25 20 BLEU score 15 RNNsearch-50 10 RNNsearch-30 5 RNNenc-50 RNNenc-30 0 0 10 20 30 40 50 60 Sentence length • RNNenc: ¡encode ¡whole ¡sentence ¡ • RNNsearch: ¡predict ¡alignment ¡ • BLEU ¡score ¡on ¡full ¡test ¡set ¡(including ¡UNK) ¡ • We ¡now ¡reached ¡SOTA ¡on ¡En-‑Fr ¡(37 ¡BLEU) ¡and ¡En-‑Ge ¡(21 ¡BLEU) ¡ ¡ 11 ¡

Co Conditional Co Computation: only visi sit a sm small fraction of parameters s / example Bengio, ¡Leonard ¡& ¡Courville ¡ ¡ arXiv ¡1305.2982 ¡ • Deep ¡nets ¡vs ¡decision ¡trees ¡ • Hard ¡mixtures ¡of ¡experts ¡ (Collobert, ¡Bengio ¡& ¡Bengio ¡ 2002) ¡ • CondiBonal ¡computaBon ¡for ¡deep ¡nets: ¡sparse ¡ distributed ¡gaters ¡selecBng ¡combinatorial ¡ subsets ¡of ¡a ¡deep ¡net ¡ • Challenges: ¡ • Credit ¡assignment ¡for ¡hard ¡decisions ¡ • Gated ¡architectures ¡exploraBon ¡ ¡

Issu ssues s wi with Ba Back- k-Prop Prop • Over ¡very ¡deep ¡nets ¡or ¡recurrent ¡nets ¡with ¡many ¡steps, ¡non-‑ lineariBes ¡compose ¡and ¡yield ¡sharp ¡non-‑linearity ¡ à ¡gradients ¡ vanish ¡or ¡explode ¡ • Training ¡deeper ¡nets: ¡harder ¡opBmizaBon ¡ • In ¡the ¡extreme ¡of ¡non-‑linearity: ¡discrete ¡funcBons, ¡can’t ¡use ¡ back-‑prop ¡ ¢ ¡ … ¡ = ¡ ¢ ¡

Issu ssues s wi with Und Undirect rected ed Gra Graphic ical Models & Models & Boltzmann Machines s • Sampling ¡from ¡the ¡MCMC ¡of ¡the ¡model ¡is ¡required ¡in ¡the ¡inner ¡ loop ¡of ¡training ¡ • As ¡the ¡model ¡gets ¡sharper, ¡mixing ¡between ¡well-‑separated ¡ modes ¡stalls ¡ Training ¡updates ¡ vicious ¡circle ¡ Mixing ¡ 15 ¡

Recurrent Recurrent Neural Networks s • SelecBvely ¡summarize ¡an ¡input ¡sequence ¡in ¡a ¡fixed-‑size ¡state ¡ vector ¡via ¡a ¡recursive ¡update ¡ F θ s t +1 s t − 1 s t s F θ F θ F θ unfold x t − 1 x t x t +1 x 16 ¡

Recurrent Recurrent Neural Networks s • Can ¡produce ¡an ¡output ¡at ¡each ¡Bme ¡step: ¡unfolding ¡the ¡graph ¡ tells ¡us ¡how ¡to ¡back-‑prop ¡through ¡Bme. ¡ o o t − 1 o t o t +1 V V V V W s t − 1 s t W s t +1 s W W W unfold U U U U x t − 1 x t x t +1 x 17 ¡

Ge Genera rative tive RN RNNs • An ¡RNN ¡can ¡represent ¡a ¡fully-‑connected ¡directed ¡generaBve ¡ model: ¡every ¡variable ¡predicted ¡from ¡all ¡previous ¡ones. ¡ L t − 1 L t L t +1 o t − 1 o t o t +1 V V V W s t − 1 s t s t +1 W W W U U U x t − 1 x t x t +1 x t +2 18 ¡

�� Genera Ge rative tive Stochast stic Nets s • Recurrent ¡nets ¡with ¡noise ¡injected ¡and ¡trained ¡to ¡reconstruct ¡ the ¡visible ¡variables ¡(inputs, ¡targets) ¡are ¡called ¡GSNs ¡ • ICML ¡2014 ¡paper: ¡they ¡esBmate ¡the ¡joint ¡distribuBon ¡of ¡the ¡ visible ¡variables ¡via ¡the ¡staBonary ¡distribuBon ¡of ¡the ¡Markov ¡ chain ¡ • Can ¡be ¡trained ¡via ¡back-‑prop, ¡no ¡need ¡to ¡get ¡reliable ¡samples ¡ from ¡the ¡chain ¡as ¡part ¡of ¡training ¡ 19 ¡

Op#miza#on Challenges for Deep Learning Yoshua Bengio U. - PowerPoint PPT Presentation

Op#miza#on Challenges for Deep Learning Yoshua Bengio U. Montreal December 12th, 2014 OPT2014: NIPS Workshop on OpBmizaBon for Machine Learning

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Simple Problems. . . Example a 0 a 1 a 2 b 0 b 1 b 2 Question What is some preferred extension?

Introduction to Computer Science CSCI 109 An al thm (pronounced AL-go-rith- algori rithm

Using Loss Surface Geometry for Practical Bayesian Deep Learning Andrew Gordon Wilson

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at

Halt return result; } true 5 6 1 The Halting Problem Undecidability Alan Turing, 1936

A-NICE-MC Jiaming Song 1. Motivation 2. Notations and Problem Setup 3. Adversarial Training for

Almost All Complex Quantifiers are Simple Jakub Szymanik MoL 2009 Outline Introduction

Lattice-based cryptography: reduced to a special closest vector Episode V: problem which is much

Op#miza#on Challenges for Deep Learning Yoshua Bengio U. - PowerPoint PPT Presentation

Op#miza#on Challenges for Deep Learning Yoshua Bengio U. Montreal December 12th, 2014 OPT2014: NIPS Workshop on OpBmizaBon for Machine Learning

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Simple Problems. . . Example a 0 a 1 a 2 b 0 b 1 b 2 Question What is some preferred extension?

Introduction to Computer Science CSCI 109 An al thm (pronounced AL-go-rith- algori rithm

Using Loss Surface Geometry for Practical Bayesian Deep Learning Andrew Gordon Wilson

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at

Halt return result; } true 5 6 1 The Halting Problem Undecidability Alan Turing, 1936

A-NICE-MC Jiaming Song 1. Motivation 2. Notations and Problem Setup 3. Adversarial Training for

Almost All Complex Quantifiers are Simple Jakub Szymanik MoL 2009 Outline Introduction

Lattice-based cryptography: reduced to a special closest vector Episode V: problem which is much

Deep learning for natural language processing A short primer on deep learning Benoit Favre <