image classification with deep networks
play

Image Classification with Deep Networks Ronan Collobert Facebook AI - PowerPoint PPT Presentation

Image Classification with Deep Networks Ronan Collobert Facebook AI Research Feb 11, 2015 Overview Origins of Deep Learning Shallow vs Deep Perceptron Multi Layer Perceptrons Going Deeper Why? Issues (and fix)?


  1. Image Classification with Deep Networks Ronan Collobert Facebook AI Research Feb 11, 2015

  2. Overview • Origins of Deep Learning • Shallow vs Deep • Perceptron • Multi Layer Perceptrons • Going Deeper • Why? • Issues (and fix)? • Convolutional Neural Networks • Fancier Architectures • Applications 2 / 65

  3. Acknowledgement Part of these slides have been cut-and-pasted from Marc’Aurelio Ranzato’s original presentation 3 / 65

  4. Shallow vs Deep 4 / 65

  5. Shallow Learning (1/2) 5 / 65

  6. Shallow Learning (2/2) Typical example 6 / 65

  7. Deep Learning (1/2) 7 / 65

  8. Deep Learning (2/2) 8 / 65

  9. Deep Learning (2/2) 9 / 65

  10. Perceptrons (shallow) 10 / 65

  11. Biological Neuron • Dendrites connected to other neurons through synapses • Excitatory and inhibitory signals are integrated • If stimulus reaches a threshold, neuron fires along the axon 11 / 65

  12. McCulloch and Pitts (1943) • Neuron as linear threshold units • Binary inputs x ∈ { 0 , 1 } d , binary output, vector of weights w ∈ R d � if w · x > T 1 f ( x ) = otherwise 0 • A unit can perform OR and AND operations • Combine these units to represent any boolean function • How to train them ? 12 / 65

  13. Perceptron: Rosenblatt (1957) • Input: retina x ∈ R n • Associative area : any kind of (fixed) function ϕ ( x ) ∈ R d • Decision function: � 1 if w · ϕ ( x ) > 0 f ( x ) = − 1 otherwise 13 / 65

  14. Perceptron: Rosenblatt (1957) wx+b=0 • Training update rule : given ( x t , y t ) ∈ R d × {− 1 , 1 } � y t ϕ ( x t ) if y t w · ϕ ( x t ) ≤ 0 w t + 1 = w t + 0 otherwise • Note that w t + 1 · ϕ ( x t ) = w t · ϕ ( x t ) + y t || ϕ ( x t ) || 2 � �� � > 0 • Corresponds to minimizing � w �→ max ( 0 , − y t w · ϕ ( x t )) t 14 / 65

  15. Multi Layer Perceptrons (deeper) 15 / 65

  16. Going Non-Linear • How to train a “good ” ϕ ( · ) in w · ϕ ( x ) ? • Many attempts have been tried! • Neocognitron (Fukushima, 1980) 16 / 65

  17. Going Non-Linear • Madaline: Winter & Widrow, 1988 • Multi Layer Perceptron W 1 × • tanh( • ) W 2 × • score x • Matrix-vector multiplications interleaved with non-linearities • Each row of W 1 corresponds to a hidden unit • The number of hidden units must be chosen carefully 17 / 65

  18. Universal Approximator (Cybenko, 1989) • Any function g : R d − → R can be approximated (on a compact) by a two-layer neural network W 1 × • W 2 × • tanh( • ) score x • Note: • I t does not say how to train it • I t does not say anything on the generalization capabilities 18 / 65

  19. Training a Neural Network • Given a network f w ( · ) with parameters W , “ input ” examples x t and “ targets ” y t , we want to minimize a loss � W �→ C ( f W ( x t ) , y t ) ( x t , y t ) • View the network + loss as a “stack” of layers f 1 ( • ) f 2 ( • ) f 3 ( • ) f 4 ( • ) x f ( x ) = f L ( f L − 1 ( . . . f 1 ( x )) • Optimization problem: use some sort of gradient descent − W l − λ ∂ f W l ← ∀ l ∂ w l How to compute ∂ f − → ∀ l ? ∂ w l 19 / 65

  20. Gradient Backpropagation (1/2) • I n the neural network fi eld: (Rumelhart et al, 1986) • However, previous possible references exist, including (Leibniz, 1675) and (Newton, 1687) • E.g., in the Adaline L = 2 w 1 × • 1 x 2 ( y − • ) • f 1 ( x ) = w 1 · x • f 2 ( f 1 ) = 1 2 ( y − f 1 ) 2 ∂ f = ∂ f 2 ∂ f 1 ∂ w 1 ∂ f 1 ∂ w 1 ���� ���� = x = y − f 1 20 / 65

  21. Gradient Backpropagation (2/2) x f 1 ( • ) f 2 ( • ) f 3 ( • ) f 4 ( • ) • Chain rule: ∂ f ∂ f L ∂ f L − 1 · · · ∂ f l + 1 ∂ f l = ∂ f ∂ f l = ∂ w l ∂ f L − 1 ∂ f L − 2 ∂ f l ∂ w l ∂ f l ∂ w l • I n the backprop way, each module f l () • Receive the gradient w.r.t. its own outputs f l • Computes the gradient w.r.t. its own input f l − 1 ( backward ) • Computes the gradient w.r.t. its own parameters w l (if any) ∂ f = ∂ f ∂ f l ∂ f l − 1 ∂ f l ∂ f l − 1 ∂ f = ∂ f ∂ f l ∂ w l ∂ f l ∂ w l 21 / 65

  22. Examples Of Modules • We denote • x the input of a module • z target of a loss module • y the output of a module f l ( x ) • ˜ y the gradient w.r.t. the output of each module Module Forward Backward Gradient W T ˜ y x T ˜ y = W x y Linear y ( 1 − y 2 ) y = tanh ( x ) ˜ Tanh y = 1 / ( 1 + e − x ) y ( 1 − y ) y ˜ Sigmoid y = max ( 0 , x ) y 1 x ≥ 0 ˜ ReLU y = max ( 0 , − z x ) − 1 z · x ≤ 0 Perceptron Loss y = 1 2 ( x − z ) 2 x − z MSE Loss 22 / 65

  23. Typical Classi fi cation Loss (euh, Likelihood) • Given a set of examples ( x t , y t ) ∈ R d × N , t = 1 . . . T we want to maximize the (log-)likelihood T T � � p ( y t | x t ) = log p ( y t | x t ) log t = 1 t = 1 • The network outputs a score f y ( x ) per class y • I nterpret scores as conditional probabilities using a softmax : e f y ( x ) p ( y | x ) = � i e f i ( x ) • I n practice we consider only log-probabilites: �� � e f i ( x ) log p ( y | x ) = f y ( x ) − log i 23 / 65

  24. Optimization Techniques Minimize � W �→ C ( f W ( x ) , y ) ( x t , y t ) • Gradient descent ( “ batch ” ) ∂ C ( f W ( x t ) , y t ) � W ← − W − λ ∂ W ( x t , y t ) • Stochastic gradient descent − W − λ∂ C ( f W ( x t ) , y t ) W ← ∂ W • Many variants, including second order techniques (where the Hessian is approximated) 24 / 65

  25. Going Deeper 25 / 65

  26. Deeper: What is the Point? (1/3) f 1 ( • ) f 2 ( • ) f 3 ( • ) f 4 ( • ) x • Share features across the “ deep ” hierarchy • Compose these features • E ffi ciency : intermediate computations are re-used [ 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 . . . ] truck feature 26 / 65

  27. Deeper: What is the Point? (2/3) Sharing [ 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 . . . ] motorbike [ 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 . . . ] truck 27 / 65

  28. Deeper: What is the Point? (3/3) Composing (Lee et al., 2009) 28 / 65

  29. Deeper: What are the I ssues? (1/2) Vanishing Gradients • Chain rule: ∂ f ∂ f L ∂ f L − 1 · · · ∂ f l + 1 ∂ f l = ∂ w l ∂ f L − 1 ∂ f L − 2 ∂ f l ∂ w l • Because transfer function non-linearities, some ∂ f l + 1 will be ∂ f l very small, or zero, when back-propagating • E.g. with ReLU ∂ y y = max ( 0 , x ) ∂ x = 1 x ≥ 0 29 / 65

  30. Deeper: What are the I ssues? (2/2) Number of Parameters • A 200 × 200 image with 1000 hidden units leads to 40 B parameters • We would need a lot of training examples • Spatial correlation is local anyways 30 / 65

  31. Fix Vanishing Gradient I ssue with Unsupervised Training (1/2) • Leverage unlabeled data (when there is no y )? • Popular way to pretrain each layer • “ Auto-encoder/bottleneck ” network x 1 W ● tanh(●) 2 W ● tanh(●) 3 W ● • Learn to reconstruct the input || f ( x ) − x || 2 • Caveats: • PCA if no W 2 layer (Bourlard & Kamp, 1988) • Projected intermediate space must be of lower dimension 31 / 65

  32. Fix Vanishing Gradient I ssue with Unsupervised Training (2/2) x 1 W ● tanh(●) 2 W ● tanh(●) 3 W ● • Possible improvements: • No W 2 layer, W 3 = � W 1 � T (Bengio et al., 2006) • Noise injection in x reconstruct the true x (Bengio et al., 2008) • I mpose sparsity constraints on the projection (Kavukcuoglu et al., 2008) 32 / 65

  33. Fix Number of Parameters I ssue by Generating Examples (1/2) • Capacity h is too large? Find more training examples L ! 33 / 65

  34. Fix Number of Parameters I ssue by Generating Examples (2/2) • Concrete example: digit recognition • Add an (in fi nite) number of random deformations (Simard et al, 2003) • State-of-the-art with 9 layers with 1000 hidden units and... a GPU (Ciresan et al, 2010) • I n general, data augmentation includes • random translation or rotation • random left/right fl ipping • random scaling 34 / 65

  35. Convolutional Neural Networks 35 / 65

  36. 2D Convolutions (1/4) • Share parameters across di ff erent locations (Fukushima, 1980) (LeCun, 1987) 36 / 65

  37. 2D Convolutions (1/4) • Share parameters across di ff erent locations (Fukushima, 1980) (LeCun, 1987) 37 / 65

  38. 2D Convolutions (1/4) • Share parameters across di ff erent locations (Fukushima, 1980) (LeCun, 1987) 38 / 65

  39. 2D Convolutions (2/4) • I t is like applying a fi lter to the image... • ...but the fi lter is trained = ⋆ = ⋆ 39 / 65

  40. 2D Convolutions (3/4) • I t is again a matrix-vector operation , but where weights are spatially “ shared ” W● 3 W● 2 W● 1 • As for normal linear layers, can be stacked for higher-level representations 40 / 65

  41. 2D Convolutions (4/4) 41 / 65

  42. Spatial Pooling (1/2) • “ Pooling ” (e.g. with a max () operation) increases robustness w.r.t. spatial location 42 / 65

  43. Spatial Pooling (2/2) Controls the capacity • A unit will see “ more ” of the image, for the same number of parameters • adding pooling decreases the size of subsequent fully connected layers! 43 / 65

Recommend


More recommend