Image Classification with Deep Networks Ronan Collobert Facebook AI Research Feb 11, 2015
Overview • Origins of Deep Learning • Shallow vs Deep • Perceptron • Multi Layer Perceptrons • Going Deeper • Why? • Issues (and fix)? • Convolutional Neural Networks • Fancier Architectures • Applications 2 / 65
Acknowledgement Part of these slides have been cut-and-pasted from Marc’Aurelio Ranzato’s original presentation 3 / 65
Shallow vs Deep 4 / 65
Shallow Learning (1/2) 5 / 65
Shallow Learning (2/2) Typical example 6 / 65
Deep Learning (1/2) 7 / 65
Deep Learning (2/2) 8 / 65
Deep Learning (2/2) 9 / 65
Perceptrons (shallow) 10 / 65
Biological Neuron • Dendrites connected to other neurons through synapses • Excitatory and inhibitory signals are integrated • If stimulus reaches a threshold, neuron fires along the axon 11 / 65
McCulloch and Pitts (1943) • Neuron as linear threshold units • Binary inputs x ∈ { 0 , 1 } d , binary output, vector of weights w ∈ R d � if w · x > T 1 f ( x ) = otherwise 0 • A unit can perform OR and AND operations • Combine these units to represent any boolean function • How to train them ? 12 / 65
Perceptron: Rosenblatt (1957) • Input: retina x ∈ R n • Associative area : any kind of (fixed) function ϕ ( x ) ∈ R d • Decision function: � 1 if w · ϕ ( x ) > 0 f ( x ) = − 1 otherwise 13 / 65
Perceptron: Rosenblatt (1957) wx+b=0 • Training update rule : given ( x t , y t ) ∈ R d × {− 1 , 1 } � y t ϕ ( x t ) if y t w · ϕ ( x t ) ≤ 0 w t + 1 = w t + 0 otherwise • Note that w t + 1 · ϕ ( x t ) = w t · ϕ ( x t ) + y t || ϕ ( x t ) || 2 � �� � > 0 • Corresponds to minimizing � w �→ max ( 0 , − y t w · ϕ ( x t )) t 14 / 65
Multi Layer Perceptrons (deeper) 15 / 65
Going Non-Linear • How to train a “good ” ϕ ( · ) in w · ϕ ( x ) ? • Many attempts have been tried! • Neocognitron (Fukushima, 1980) 16 / 65
Going Non-Linear • Madaline: Winter & Widrow, 1988 • Multi Layer Perceptron W 1 × • tanh( • ) W 2 × • score x • Matrix-vector multiplications interleaved with non-linearities • Each row of W 1 corresponds to a hidden unit • The number of hidden units must be chosen carefully 17 / 65
Universal Approximator (Cybenko, 1989) • Any function g : R d − → R can be approximated (on a compact) by a two-layer neural network W 1 × • W 2 × • tanh( • ) score x • Note: • I t does not say how to train it • I t does not say anything on the generalization capabilities 18 / 65
Training a Neural Network • Given a network f w ( · ) with parameters W , “ input ” examples x t and “ targets ” y t , we want to minimize a loss � W �→ C ( f W ( x t ) , y t ) ( x t , y t ) • View the network + loss as a “stack” of layers f 1 ( • ) f 2 ( • ) f 3 ( • ) f 4 ( • ) x f ( x ) = f L ( f L − 1 ( . . . f 1 ( x )) • Optimization problem: use some sort of gradient descent − W l − λ ∂ f W l ← ∀ l ∂ w l How to compute ∂ f − → ∀ l ? ∂ w l 19 / 65
Gradient Backpropagation (1/2) • I n the neural network fi eld: (Rumelhart et al, 1986) • However, previous possible references exist, including (Leibniz, 1675) and (Newton, 1687) • E.g., in the Adaline L = 2 w 1 × • 1 x 2 ( y − • ) • f 1 ( x ) = w 1 · x • f 2 ( f 1 ) = 1 2 ( y − f 1 ) 2 ∂ f = ∂ f 2 ∂ f 1 ∂ w 1 ∂ f 1 ∂ w 1 ���� ���� = x = y − f 1 20 / 65
Gradient Backpropagation (2/2) x f 1 ( • ) f 2 ( • ) f 3 ( • ) f 4 ( • ) • Chain rule: ∂ f ∂ f L ∂ f L − 1 · · · ∂ f l + 1 ∂ f l = ∂ f ∂ f l = ∂ w l ∂ f L − 1 ∂ f L − 2 ∂ f l ∂ w l ∂ f l ∂ w l • I n the backprop way, each module f l () • Receive the gradient w.r.t. its own outputs f l • Computes the gradient w.r.t. its own input f l − 1 ( backward ) • Computes the gradient w.r.t. its own parameters w l (if any) ∂ f = ∂ f ∂ f l ∂ f l − 1 ∂ f l ∂ f l − 1 ∂ f = ∂ f ∂ f l ∂ w l ∂ f l ∂ w l 21 / 65
Examples Of Modules • We denote • x the input of a module • z target of a loss module • y the output of a module f l ( x ) • ˜ y the gradient w.r.t. the output of each module Module Forward Backward Gradient W T ˜ y x T ˜ y = W x y Linear y ( 1 − y 2 ) y = tanh ( x ) ˜ Tanh y = 1 / ( 1 + e − x ) y ( 1 − y ) y ˜ Sigmoid y = max ( 0 , x ) y 1 x ≥ 0 ˜ ReLU y = max ( 0 , − z x ) − 1 z · x ≤ 0 Perceptron Loss y = 1 2 ( x − z ) 2 x − z MSE Loss 22 / 65
Typical Classi fi cation Loss (euh, Likelihood) • Given a set of examples ( x t , y t ) ∈ R d × N , t = 1 . . . T we want to maximize the (log-)likelihood T T � � p ( y t | x t ) = log p ( y t | x t ) log t = 1 t = 1 • The network outputs a score f y ( x ) per class y • I nterpret scores as conditional probabilities using a softmax : e f y ( x ) p ( y | x ) = � i e f i ( x ) • I n practice we consider only log-probabilites: �� � e f i ( x ) log p ( y | x ) = f y ( x ) − log i 23 / 65
Optimization Techniques Minimize � W �→ C ( f W ( x ) , y ) ( x t , y t ) • Gradient descent ( “ batch ” ) ∂ C ( f W ( x t ) , y t ) � W ← − W − λ ∂ W ( x t , y t ) • Stochastic gradient descent − W − λ∂ C ( f W ( x t ) , y t ) W ← ∂ W • Many variants, including second order techniques (where the Hessian is approximated) 24 / 65
Going Deeper 25 / 65
Deeper: What is the Point? (1/3) f 1 ( • ) f 2 ( • ) f 3 ( • ) f 4 ( • ) x • Share features across the “ deep ” hierarchy • Compose these features • E ffi ciency : intermediate computations are re-used [ 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 . . . ] truck feature 26 / 65
Deeper: What is the Point? (2/3) Sharing [ 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 . . . ] motorbike [ 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 . . . ] truck 27 / 65
Deeper: What is the Point? (3/3) Composing (Lee et al., 2009) 28 / 65
Deeper: What are the I ssues? (1/2) Vanishing Gradients • Chain rule: ∂ f ∂ f L ∂ f L − 1 · · · ∂ f l + 1 ∂ f l = ∂ w l ∂ f L − 1 ∂ f L − 2 ∂ f l ∂ w l • Because transfer function non-linearities, some ∂ f l + 1 will be ∂ f l very small, or zero, when back-propagating • E.g. with ReLU ∂ y y = max ( 0 , x ) ∂ x = 1 x ≥ 0 29 / 65
Deeper: What are the I ssues? (2/2) Number of Parameters • A 200 × 200 image with 1000 hidden units leads to 40 B parameters • We would need a lot of training examples • Spatial correlation is local anyways 30 / 65
Fix Vanishing Gradient I ssue with Unsupervised Training (1/2) • Leverage unlabeled data (when there is no y )? • Popular way to pretrain each layer • “ Auto-encoder/bottleneck ” network x 1 W ● tanh(●) 2 W ● tanh(●) 3 W ● • Learn to reconstruct the input || f ( x ) − x || 2 • Caveats: • PCA if no W 2 layer (Bourlard & Kamp, 1988) • Projected intermediate space must be of lower dimension 31 / 65
Fix Vanishing Gradient I ssue with Unsupervised Training (2/2) x 1 W ● tanh(●) 2 W ● tanh(●) 3 W ● • Possible improvements: • No W 2 layer, W 3 = � W 1 � T (Bengio et al., 2006) • Noise injection in x reconstruct the true x (Bengio et al., 2008) • I mpose sparsity constraints on the projection (Kavukcuoglu et al., 2008) 32 / 65
Fix Number of Parameters I ssue by Generating Examples (1/2) • Capacity h is too large? Find more training examples L ! 33 / 65
Fix Number of Parameters I ssue by Generating Examples (2/2) • Concrete example: digit recognition • Add an (in fi nite) number of random deformations (Simard et al, 2003) • State-of-the-art with 9 layers with 1000 hidden units and... a GPU (Ciresan et al, 2010) • I n general, data augmentation includes • random translation or rotation • random left/right fl ipping • random scaling 34 / 65
Convolutional Neural Networks 35 / 65
2D Convolutions (1/4) • Share parameters across di ff erent locations (Fukushima, 1980) (LeCun, 1987) 36 / 65
2D Convolutions (1/4) • Share parameters across di ff erent locations (Fukushima, 1980) (LeCun, 1987) 37 / 65
2D Convolutions (1/4) • Share parameters across di ff erent locations (Fukushima, 1980) (LeCun, 1987) 38 / 65
2D Convolutions (2/4) • I t is like applying a fi lter to the image... • ...but the fi lter is trained = ⋆ = ⋆ 39 / 65
2D Convolutions (3/4) • I t is again a matrix-vector operation , but where weights are spatially “ shared ” W● 3 W● 2 W● 1 • As for normal linear layers, can be stacked for higher-level representations 40 / 65
2D Convolutions (4/4) 41 / 65
Spatial Pooling (1/2) • “ Pooling ” (e.g. with a max () operation) increases robustness w.r.t. spatial location 42 / 65
Spatial Pooling (2/2) Controls the capacity • A unit will see “ more ” of the image, for the same number of parameters • adding pooling decreases the size of subsequent fully connected layers! 43 / 65
Recommend
More recommend