Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain

2 NEURAL NETWORKS • What we’ll cover ... • f ( x ) ‣ types of learning problems - definitions of popular learning problems - how to define an architecture for a learning problem 1 ... ... ‣ unintuitive properties of neural networks - adversarial examples - optimization landscape of neural networks 1 ... ... 1 ... ... • x 1 x j x d x

Neural Networks Types of learning problems

        4 SUPERVISED LEARNING Topics: supervised learning • Training time • Test time • Example ‣ data :   ‣ data :   ‣ classification ‣ regression { x ( t ) , y ( t ) } { x ( t ) , y ( t ) } ‣ setting : ‣ setting : x ( t ) , y ( t ) ∼ p ( x , y ) x ( t ) , y ( t ) ∼ p ( x , y )

        5 UNSUPERVISED LEARNING Topics: unsupervised learning • Training time • Test time • Example ‣ data :   ‣ data :   ‣ distribution estimation ‣ dimensionality { x ( t ) } { x ( t ) } reduction ‣ setting : ‣ setting : x ( t ) ∼ p ( x ) x ( t ) ∼ p ( x )

        6 SEMI-SUPERVISED LEARNING Topics: semi-supervised learning • Training time • Test time ‣ data :   ‣ data :   { x ( t ) , y ( t ) } { x ( t ) , y ( t ) } { x ( t ) } ‣ setting : ‣ setting : x ( t ) , y ( t ) ∼ p ( x , y ) x ( t ) , y ( t ) ∼ p ( x , y ) x ( t ) ∼ p ( x )

        7 MULTITASK LEARNING Topics: multitask learning • Training time • Test time • Example ‣ data :   ‣ data :   ‣ object recognition in images with multiple { x ( t ) , y ( t ) 1 , . . . , y ( t ) { x ( t ) , y ( t ) 1 , . . . , y ( t ) objects M } M } ‣ setting : ‣ setting : x ( t ) , y ( t ) 1 , . . . , y ( t ) x ( t ) , y ( t ) 1 , . . . , y ( t ) M ∼ M ∼ p ( x , y 1 , . . . , y M ) p ( x , y 1 , . . . , y M )

8 MULTITASK LEARNING Topics: multitask learning y ... ... ... h (2) ( x ) W (2) • ... ... • h (1) ( x ) ) W (1) ... ... • x 1 x j x d

8 MULTITASK LEARNING Topics: multitask learning y 1 y y 3 2 ... ... ... ... ... h (2) ( x ) W (2) • ... ... • h (1) ( x ) ) W (1) ... ... • x 1 x j x d

        9 TRANSFER LEARNING Topics: transfer learning • Training time • Test time ‣ data :   ‣ data :   { x ( t ) , y ( t ) { x ( t ) , y ( t ) 1 , . . . , y ( t ) 1 } M } ‣ setting : ‣ setting : x ( t ) , y ( t ) 1 , . . . , y ( t ) x ( t ) , y ( t ) ∼ p ( x , y 1 ) M ∼ 1 p ( x , y 1 , . . . , y M )

        10 STRUCTURED OUTPUT PREDICTION Topics: structured output prediction • Training time • Test time • Example ‣ data :   ‣ data :   ‣ image caption generation { x ( t ) , y ( t ) } { x ( t ) , y ( t ) } ‣ machine translation of arbitrary structure (vector, sequence, graph) ‣ setting : ‣ setting : x ( t ) , y ( t ) ∼ p ( x , y ) x ( t ) , y ( t ) ∼ p ( x , y )

        11 DOMAIN ADAPTATION Topics: domain adaptation, covariate shift • Training time • Test time • Example ‣ data :   ‣ data :   ‣ classify sentiment in reviews of different { x ( t ) , y ( t ) } x ( t ) , y ( t ) } products { ¯ x ( t 0 ) } ‣ training on synthetic { ¯ data but testing on real data (sim2real) ‣ setting : ‣ setting : x ( t ) ∼ p ( x ) x ( t ) ∼ q ( x ) ¯ y ( t ) ∼ p ( y | x ( t ) ) y ( t ) ∼ p ( y | ¯ x ( t ) ) x ( t ) ∼ q ( x ) ≈ p ( x ) ¯

12 DOMAIN ADAPTATION Topics: domain adaptation, covariate shift • Domain-adversarial networks (Ganin et al. 2015)   train hidden layer representation to be o ( h ( x )) f ( x ) d c 1. predictive of the target class 2. indiscriminate of the domain V w • Trained by stochastic gradient descent h ( x ) = b x ( t 0 ) x ( t ) , ¯ ‣ for each random pair W 1. update W,V,b,c in opposite direction of gradient 2. update w, d in direction of gradient x

12 DOMAIN ADAPTATION Topics: domain adaptation, covariate shift • Domain-adversarial networks (Ganin et al. 2015)   train hidden layer representation to be o ( h ( x )) f ( x ) d c 1. predictive of the target class 2. indiscriminate of the domain V w • Trained by stochastic gradient descent h ( x ) = b x ( t 0 ) x ( t ) , ¯ ‣ for each random pair W 1. update W,V,b,c in opposite direction of gradient May also be used to promote   2. update w, d in direction of gradient x fair and unbiased models …

              13 ONE-SHOT LEARNING Topics: one-shot learning • Training time • Test time • Example ‣ data :   ‣ data :   ‣ recognizing a person based on a single { x ( t ) , y ( t ) } { x ( t ) , y ( t ) } picture of him/her ‣ setting : ‣ setting :   x ( t ) , y ( t ) ∼ p ( x , y ) x ( t ) , y ( t ) ∼ p ( x , y ) subject to y ( t ) ∈ { 1 , . . . , C } subject to y ( t ) ∈ { C + 1 , . . . , C + M } ‣ side information : - a single labeled example from each of the M new classes

14 ONE-SHOT LEARNING Topics: one-shot learning Siamese architecture a b D[y ,y ] (figure taken from Salakhutdinov   a b y y 30 30 and Hinton, 2007) W W 4 4 2000 2000 W W 3 3 500 500 W W 2 2 500 500 W W 1 1 a b X X

                    15 ZERO-SHOT LEARNING Topics: zero-shot learning, zero-data learning • Training time • Test time • Example ‣ data :   ‣ data :   ‣ recognizing an object based on a worded { x ( t ) , y ( t ) } { x ( t ) , y ( t ) } description of it ‣ setting :   ‣ setting :   x ( t ) , y ( t ) ∼ p ( x , y ) x ( t ) , y ( t ) ∼ p ( x , y ) subject to y ( t ) ∈ { 1 , . . . , C } subject to y ( t ) ∈ { C + 1 , . . . , C + M } ‣ side information : ‣ side information : - description vector z c of each of - description vector z c of each of the C classes the new M classes

16 ZERO-SHOT LEARNING Topics: zero-shot learning, zero-data learning Ba, Swersky, Fidler, Salakhutdinov Class 1xC score arxiv 2015 Dot product f g Cxk 1xk MLP CNN TF-IDF … family birds north genus south america Image Wikipedia article The Cardinals or Cardinalidae are a family of passerine birds found in North and South America The South American cardinals in the genus…

17 DESIGNING NEW ARCHITECTURES Topics: designing new architectures • Tackling a new learning problem often requires designing   an adapted neural architecture • Approach 1: use our intuition for how a human would reason about the problem • Approach 2: take an existing algorithm/procedure and   turn it into a neural network

18 DESIGNING NEW ARCHITECTURES Topics: designing new architectures • Many other examples ‣ structured prediction by unrolling probabilistic inference in an MRF ‣ planning by unrolling the value iteration algorithm   (Tamar et al., NIPS 2016) ‣ few-shot learning by unrolling gradient descent on small training set Ravi and Larochelle, ICLR 2017 _ _ Neural   network _ _ Learning   algorithm

Neural networks Unintuitive properties of neural networks

20 THEY CAN MAKE DUMB ERRORS Topics: adversarial examples • Intriguing Properties of Neural Networks   Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, Fergus, ICLR 2014 Correctly   Badly   Difference classified classified

21 THEY CAN MAKE DUMB ERRORS Topics: adversarial examples • Humans have adversarial examples too • However they don’t match those of neural networks

22 THEY CAN MAKE DUMB ERRORS Topics: adversarial examples • Humans have adversarial examples too • However they don’t match those of neural networks

23 THEY ARE STRANGELY NON-CONVEX Topics: non-convexity, saddle points • Identifying and attacking the saddle point problem in high-dimensional non-convex optimization   Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014 avg loss θ

24 THEY ARE STRANGELY NON-CONVEX Topics: non-convexity, saddle points • Identifying and attacking the saddle point problem in high-dimensional non-convex optimization   Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014

25 THEY ARE STRANGELY NON-CONVEX Topics: non-convexity, saddle points • Qualitatively Characterizing Neural Network Optimization Problems   Goodfellow, Vinyals, Saxe, ICLR 2015

26 THEY ARE STRANGELY NON-CONVEX Topics: non-convexity, saddle points • If dataset is created by labeling points using a N- hidden units neural network ‣ training another N- hidden units network is likely to fail ‣ but training a larger neural network is more likely to work!   (saddle points seem to be a blessing)

27 THEY WORK BEST WHEN BADLY TRAINED Topics: sharp vs. flat miniman • Flat Minima   Hochreiter, Schmidhuber, Neural Computation 1997 avg loss Training Function Testing Function x ) Flat Minimum Sharp Minimum θ

28 THEY WORK BEST WHEN BADLY TRAINED Topics: sharp vs. flat miniman • On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima   Keskar, Mudigere, Nocedal, Smelyanskiy, Tang, ICLR 2017 ‣ found that using large batch sizes tends to find sharper minima and generalize worse • This means that we can’t talk about generalization without taking the training algorithm into account

Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain - PowerPoint PPT Presentation

Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain 2 NEURAL NETWORKS What well cover ... f ( x ) types of learning problems - definitions of popular learning problems - how to define an architecture for a learning

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks 1. Introduction Spring 2019 1 Neural Networks are taking over! Neural

Strategies to Overcome Inequality in South Africa: Thinking Inside and Outside of the Box Murray

Supercomputing Notes Focusing on Science and GPUs A. Norman GPU Impressions Common theme

Mining Source Code^3 Mining Idioms, Usages and Edits Dario Di Nucci Research Fellow

Quantum resource theories of quantum channels Xin Wang Baidu Research TQC 2020 Based on

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Requirements Engineering Requirements Engineering Week 5 Agenda (Lecture) Agenda (Lecture)

An integrated framework in R for textual sentiment time series aggregation and prediction Ardia,

Image Restoration by Deconvolution: Concepts and Applications Chong Zhang SIMBioSys, Depertment

Sambuz

Useful Links

Newsletter

Mail Us