Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain
2 NEURAL NETWORKS • What we’ll cover ... • f ( x ) ‣ types of learning problems - definitions of popular learning problems - how to define an architecture for a learning problem 1 ... ... ‣ unintuitive properties of neural networks - adversarial examples - optimization landscape of neural networks 1 ... ... 1 ... ... • x 1 x j x d x
Neural Networks Types of learning problems
4 SUPERVISED LEARNING Topics: supervised learning • Training time • Test time • Example ‣ data : ‣ data : ‣ classification ‣ regression { x ( t ) , y ( t ) } { x ( t ) , y ( t ) } ‣ setting : ‣ setting : x ( t ) , y ( t ) ∼ p ( x , y ) x ( t ) , y ( t ) ∼ p ( x , y )
5 UNSUPERVISED LEARNING Topics: unsupervised learning • Training time • Test time • Example ‣ data : ‣ data : ‣ distribution estimation ‣ dimensionality { x ( t ) } { x ( t ) } reduction ‣ setting : ‣ setting : x ( t ) ∼ p ( x ) x ( t ) ∼ p ( x )
6 SEMI-SUPERVISED LEARNING Topics: semi-supervised learning • Training time • Test time ‣ data : ‣ data : { x ( t ) , y ( t ) } { x ( t ) , y ( t ) } { x ( t ) } ‣ setting : ‣ setting : x ( t ) , y ( t ) ∼ p ( x , y ) x ( t ) , y ( t ) ∼ p ( x , y ) x ( t ) ∼ p ( x )
7 MULTITASK LEARNING Topics: multitask learning • Training time • Test time • Example ‣ data : ‣ data : ‣ object recognition in images with multiple { x ( t ) , y ( t ) 1 , . . . , y ( t ) { x ( t ) , y ( t ) 1 , . . . , y ( t ) objects M } M } ‣ setting : ‣ setting : x ( t ) , y ( t ) 1 , . . . , y ( t ) x ( t ) , y ( t ) 1 , . . . , y ( t ) M ∼ M ∼ p ( x , y 1 , . . . , y M ) p ( x , y 1 , . . . , y M )
8 MULTITASK LEARNING Topics: multitask learning y ... ... ... h (2) ( x ) W (2) • ... ... • h (1) ( x ) ) W (1) ... ... • x 1 x j x d
8 MULTITASK LEARNING Topics: multitask learning y 1 y y 3 2 ... ... ... ... ... h (2) ( x ) W (2) • ... ... • h (1) ( x ) ) W (1) ... ... • x 1 x j x d
9 TRANSFER LEARNING Topics: transfer learning • Training time • Test time ‣ data : ‣ data : { x ( t ) , y ( t ) { x ( t ) , y ( t ) 1 , . . . , y ( t ) 1 } M } ‣ setting : ‣ setting : x ( t ) , y ( t ) 1 , . . . , y ( t ) x ( t ) , y ( t ) ∼ p ( x , y 1 ) M ∼ 1 p ( x , y 1 , . . . , y M )
10 STRUCTURED OUTPUT PREDICTION Topics: structured output prediction • Training time • Test time • Example ‣ data : ‣ data : ‣ image caption generation { x ( t ) , y ( t ) } { x ( t ) , y ( t ) } ‣ machine translation of arbitrary structure (vector, sequence, graph) ‣ setting : ‣ setting : x ( t ) , y ( t ) ∼ p ( x , y ) x ( t ) , y ( t ) ∼ p ( x , y )
11 DOMAIN ADAPTATION Topics: domain adaptation, covariate shift • Training time • Test time • Example ‣ data : ‣ data : ‣ classify sentiment in reviews of different { x ( t ) , y ( t ) } x ( t ) , y ( t ) } products { ¯ x ( t 0 ) } ‣ training on synthetic { ¯ data but testing on real data (sim2real) ‣ setting : ‣ setting : x ( t ) ∼ p ( x ) x ( t ) ∼ q ( x ) ¯ y ( t ) ∼ p ( y | x ( t ) ) y ( t ) ∼ p ( y | ¯ x ( t ) ) x ( t ) ∼ q ( x ) ≈ p ( x ) ¯
12 DOMAIN ADAPTATION Topics: domain adaptation, covariate shift • Domain-adversarial networks (Ganin et al. 2015) train hidden layer representation to be o ( h ( x )) f ( x ) d c 1. predictive of the target class 2. indiscriminate of the domain V w • Trained by stochastic gradient descent h ( x ) = b x ( t 0 ) x ( t ) , ¯ ‣ for each random pair W 1. update W,V,b,c in opposite direction of gradient 2. update w, d in direction of gradient x
12 DOMAIN ADAPTATION Topics: domain adaptation, covariate shift • Domain-adversarial networks (Ganin et al. 2015) train hidden layer representation to be o ( h ( x )) f ( x ) d c 1. predictive of the target class 2. indiscriminate of the domain V w • Trained by stochastic gradient descent h ( x ) = b x ( t 0 ) x ( t ) , ¯ ‣ for each random pair W 1. update W,V,b,c in opposite direction of gradient May also be used to promote 2. update w, d in direction of gradient x fair and unbiased models …
13 ONE-SHOT LEARNING Topics: one-shot learning • Training time • Test time • Example ‣ data : ‣ data : ‣ recognizing a person based on a single { x ( t ) , y ( t ) } { x ( t ) , y ( t ) } picture of him/her ‣ setting : ‣ setting : x ( t ) , y ( t ) ∼ p ( x , y ) x ( t ) , y ( t ) ∼ p ( x , y ) subject to y ( t ) ∈ { 1 , . . . , C } subject to y ( t ) ∈ { C + 1 , . . . , C + M } ‣ side information : - a single labeled example from each of the M new classes
14 ONE-SHOT LEARNING Topics: one-shot learning Siamese architecture a b D[y ,y ] (figure taken from Salakhutdinov a b y y 30 30 and Hinton, 2007) W W 4 4 2000 2000 W W 3 3 500 500 W W 2 2 500 500 W W 1 1 a b X X
15 ZERO-SHOT LEARNING Topics: zero-shot learning, zero-data learning • Training time • Test time • Example ‣ data : ‣ data : ‣ recognizing an object based on a worded { x ( t ) , y ( t ) } { x ( t ) , y ( t ) } description of it ‣ setting : ‣ setting : x ( t ) , y ( t ) ∼ p ( x , y ) x ( t ) , y ( t ) ∼ p ( x , y ) subject to y ( t ) ∈ { 1 , . . . , C } subject to y ( t ) ∈ { C + 1 , . . . , C + M } ‣ side information : ‣ side information : - description vector z c of each of - description vector z c of each of the C classes the new M classes
16 ZERO-SHOT LEARNING Topics: zero-shot learning, zero-data learning Ba, Swersky, Fidler, Salakhutdinov Class 1xC score arxiv 2015 Dot product f g Cxk 1xk MLP CNN TF-IDF … family birds north genus south america Image Wikipedia article The Cardinals or Cardinalidae are a family of passerine birds found in North and South America The South American cardinals in the genus…
17 DESIGNING NEW ARCHITECTURES Topics: designing new architectures • Tackling a new learning problem often requires designing an adapted neural architecture • Approach 1: use our intuition for how a human would reason about the problem • Approach 2: take an existing algorithm/procedure and turn it into a neural network
18 DESIGNING NEW ARCHITECTURES Topics: designing new architectures • Many other examples ‣ structured prediction by unrolling probabilistic inference in an MRF ‣ planning by unrolling the value iteration algorithm (Tamar et al., NIPS 2016) ‣ few-shot learning by unrolling gradient descent on small training set Ravi and Larochelle, ICLR 2017 _ _ Neural network _ _ Learning algorithm
Neural networks Unintuitive properties of neural networks
20 THEY CAN MAKE DUMB ERRORS Topics: adversarial examples • Intriguing Properties of Neural Networks Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, Fergus, ICLR 2014 Correctly Badly Difference classified classified
21 THEY CAN MAKE DUMB ERRORS Topics: adversarial examples • Humans have adversarial examples too • However they don’t match those of neural networks
22 THEY CAN MAKE DUMB ERRORS Topics: adversarial examples • Humans have adversarial examples too • However they don’t match those of neural networks
23 THEY ARE STRANGELY NON-CONVEX Topics: non-convexity, saddle points • Identifying and attacking the saddle point problem in high-dimensional non-convex optimization Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014 avg loss θ
23 THEY ARE STRANGELY NON-CONVEX Topics: non-convexity, saddle points • Identifying and attacking the saddle point problem in high-dimensional non-convex optimization Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014 avg loss θ
24 THEY ARE STRANGELY NON-CONVEX Topics: non-convexity, saddle points • Identifying and attacking the saddle point problem in high-dimensional non-convex optimization Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014
25 THEY ARE STRANGELY NON-CONVEX Topics: non-convexity, saddle points • Qualitatively Characterizing Neural Network Optimization Problems Goodfellow, Vinyals, Saxe, ICLR 2015
26 THEY ARE STRANGELY NON-CONVEX Topics: non-convexity, saddle points • If dataset is created by labeling points using a N- hidden units neural network ‣ training another N- hidden units network is likely to fail ‣ but training a larger neural network is more likely to work! (saddle points seem to be a blessing)
27 THEY WORK BEST WHEN BADLY TRAINED Topics: sharp vs. flat miniman • Flat Minima Hochreiter, Schmidhuber, Neural Computation 1997 avg loss Training Function Testing Function x ) Flat Minimum Sharp Minimum θ
28 THEY WORK BEST WHEN BADLY TRAINED Topics: sharp vs. flat miniman • On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima Keskar, Mudigere, Nocedal, Smelyanskiy, Tang, ICLR 2017 ‣ found that using large batch sizes tends to find sharper minima and generalize worse • This means that we can’t talk about generalization without taking the training algorithm into account
Recommend
More recommend