Image Classification with Deep Networks Ronan Collobert Facebook AI - PowerPoint PPT Presentation

Image Classification with Deep Networks Ronan Collobert Facebook AI Research Feb 11, 2015

Overview • Origins of Deep Learning • Shallow vs Deep • Perceptron • Multi Layer Perceptrons • Going Deeper • Why? • Issues (and fix)? • Convolutional Neural Networks • Fancier Architectures • Applications 2 / 65

Acknowledgement Part of these slides have been cut-and-pasted from Marc’Aurelio Ranzato’s original presentation 3 / 65

Shallow vs Deep 4 / 65

Shallow Learning (1/2) 5 / 65

Shallow Learning (2/2) Typical example 6 / 65

Deep Learning (1/2) 7 / 65

Perceptrons (shallow) 10 / 65

Biological Neuron • Dendrites connected to other neurons through synapses • Excitatory and inhibitory signals are integrated • If stimulus reaches a threshold, neuron fires along the axon 11 / 65

McCulloch and Pitts (1943) • Neuron as linear threshold units • Binary inputs x ∈ { 0 , 1 } d , binary output, vector of weights w ∈ R d � if w · x > T 1 f ( x ) = otherwise 0 • A unit can perform OR and AND operations • Combine these units to represent any boolean function • How to train them ? 12 / 65

Perceptron: Rosenblatt (1957) • Input: retina x ∈ R n • Associative area : any kind of (fixed) function ϕ ( x ) ∈ R d • Decision function: � 1 if w · ϕ ( x ) > 0 f ( x ) = − 1 otherwise 13 / 65

Perceptron: Rosenblatt (1957) wx+b=0 • Training update rule : given ( x t , y t ) ∈ R d × {− 1 , 1 } � y t ϕ ( x t ) if y t w · ϕ ( x t ) ≤ 0 w t + 1 = w t + 0 otherwise • Note that w t + 1 · ϕ ( x t ) = w t · ϕ ( x t ) + y t || ϕ ( x t ) || 2 � �� > 0 • Corresponds to minimizing � w �→ max ( 0 , − y t w · ϕ ( x t )) t 14 / 65

Multi Layer Perceptrons (deeper) 15 / 65

Going Non-Linear • How to train a “good ” ϕ ( · ) in w · ϕ ( x ) ? • Many attempts have been tried! • Neocognitron (Fukushima, 1980) 16 / 65

Going Non-Linear • Madaline: Winter & Widrow, 1988 • Multi Layer Perceptron W 1 × • tanh( • ) W 2 × • score x • Matrix-vector multiplications interleaved with non-linearities • Each row of W 1 corresponds to a hidden unit • The number of hidden units must be chosen carefully 17 / 65

Universal Approximator (Cybenko, 1989) • Any function g : R d − → R can be approximated (on a compact) by a two-layer neural network W 1 × • W 2 × • tanh( • ) score x • Note: • I t does not say how to train it • I t does not say anything on the generalization capabilities 18 / 65

Training a Neural Network • Given a network f w ( · ) with parameters W , “ input ” examples x t and “ targets ” y t , we want to minimize a loss � W �→ C ( f W ( x t ) , y t ) ( x t , y t ) • View the network + loss as a “stack” of layers f 1 ( • ) f 2 ( • ) f 3 ( • ) f 4 ( • ) x f ( x ) = f L ( f L − 1 ( . . . f 1 ( x )) • Optimization problem: use some sort of gradient descent − W l − λ ∂ f W l ← ∀ l ∂ w l How to compute ∂ f − → ∀ l ? ∂ w l 19 / 65

Gradient Backpropagation (1/2) • I n the neural network fi eld: (Rumelhart et al, 1986) • However, previous possible references exist, including (Leibniz, 1675) and (Newton, 1687) • E.g., in the Adaline L = 2 w 1 × • 1 x 2 ( y − • ) • f 1 ( x ) = w 1 · x • f 2 ( f 1 ) = 1 2 ( y − f 1 ) 2 ∂ f = ∂ f 2 ∂ f 1 ∂ w 1 ∂ f 1 ∂ w 1 �� = x = y − f 1 20 / 65

Gradient Backpropagation (2/2) x f 1 ( • ) f 2 ( • ) f 3 ( • ) f 4 ( • ) • Chain rule: ∂ f ∂ f L ∂ f L − 1 · · · ∂ f l + 1 ∂ f l = ∂ f ∂ f l = ∂ w l ∂ f L − 1 ∂ f L − 2 ∂ f l ∂ w l ∂ f l ∂ w l • I n the backprop way, each module f l () • Receive the gradient w.r.t. its own outputs f l • Computes the gradient w.r.t. its own input f l − 1 ( backward ) • Computes the gradient w.r.t. its own parameters w l (if any) ∂ f = ∂ f ∂ f l ∂ f l − 1 ∂ f l ∂ f l − 1 ∂ f = ∂ f ∂ f l ∂ w l ∂ f l ∂ w l 21 / 65

Examples Of Modules • We denote • x the input of a module • z target of a loss module • y the output of a module f l ( x ) • ˜ y the gradient w.r.t. the output of each module Module Forward Backward Gradient W T ˜ y x T ˜ y = W x y Linear y ( 1 − y 2 ) y = tanh ( x ) ˜ Tanh y = 1 / ( 1 + e − x ) y ( 1 − y ) y ˜ Sigmoid y = max ( 0 , x ) y 1 x ≥ 0 ˜ ReLU y = max ( 0 , − z x ) − 1 z · x ≤ 0 Perceptron Loss y = 1 2 ( x − z ) 2 x − z MSE Loss 22 / 65

Typical Classi fi cation Loss (euh, Likelihood) • Given a set of examples ( x t , y t ) ∈ R d × N , t = 1 . . . T we want to maximize the (log-)likelihood T T � � p ( y t | x t ) = log p ( y t | x t ) log t = 1 t = 1 • The network outputs a score f y ( x ) per class y • I nterpret scores as conditional probabilities using a softmax : e f y ( x ) p ( y | x ) = � i e f i ( x ) • I n practice we consider only log-probabilites: �� e f i ( x ) log p ( y | x ) = f y ( x ) − log i 23 / 65

Optimization Techniques Minimize � W �→ C ( f W ( x ) , y ) ( x t , y t ) • Gradient descent ( “ batch ” ) ∂ C ( f W ( x t ) , y t ) � W ← − W − λ ∂ W ( x t , y t ) • Stochastic gradient descent − W − λ∂ C ( f W ( x t ) , y t ) W ← ∂ W • Many variants, including second order techniques (where the Hessian is approximated) 24 / 65

Going Deeper 25 / 65

Deeper: What is the Point? (1/3) f 1 ( • ) f 2 ( • ) f 3 ( • ) f 4 ( • ) x • Share features across the “ deep ” hierarchy • Compose these features • E ffi ciency : intermediate computations are re-used [ 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 . . . ] truck feature 26 / 65

Deeper: What is the Point? (2/3) Sharing [ 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 . . . ] motorbike [ 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 . . . ] truck 27 / 65

Deeper: What is the Point? (3/3) Composing (Lee et al., 2009) 28 / 65

Deeper: What are the I ssues? (1/2) Vanishing Gradients • Chain rule: ∂ f ∂ f L ∂ f L − 1 · · · ∂ f l + 1 ∂ f l = ∂ w l ∂ f L − 1 ∂ f L − 2 ∂ f l ∂ w l • Because transfer function non-linearities, some ∂ f l + 1 will be ∂ f l very small, or zero, when back-propagating • E.g. with ReLU ∂ y y = max ( 0 , x ) ∂ x = 1 x ≥ 0 29 / 65

Deeper: What are the I ssues? (2/2) Number of Parameters • A 200 × 200 image with 1000 hidden units leads to 40 B parameters • We would need a lot of training examples • Spatial correlation is local anyways 30 / 65

Fix Vanishing Gradient I ssue with Unsupervised Training (1/2) • Leverage unlabeled data (when there is no y )? • Popular way to pretrain each layer • “ Auto-encoder/bottleneck ” network x 1 W ● tanh(●) 2 W ● tanh(●) 3 W ● • Learn to reconstruct the input || f ( x ) − x || 2 • Caveats: • PCA if no W 2 layer (Bourlard & Kamp, 1988) • Projected intermediate space must be of lower dimension 31 / 65

Fix Vanishing Gradient I ssue with Unsupervised Training (2/2) x 1 W ● tanh(●) 2 W ● tanh(●) 3 W ● • Possible improvements: • No W 2 layer, W 3 = � W 1 � T (Bengio et al., 2006) • Noise injection in x reconstruct the true x (Bengio et al., 2008) • I mpose sparsity constraints on the projection (Kavukcuoglu et al., 2008) 32 / 65

Fix Number of Parameters I ssue by Generating Examples (1/2) • Capacity h is too large? Find more training examples L ! 33 / 65

Fix Number of Parameters I ssue by Generating Examples (2/2) • Concrete example: digit recognition • Add an (in fi nite) number of random deformations (Simard et al, 2003) • State-of-the-art with 9 layers with 1000 hidden units and... a GPU (Ciresan et al, 2010) • I n general, data augmentation includes • random translation or rotation • random left/right fl ipping • random scaling 34 / 65

Convolutional Neural Networks 35 / 65

2D Convolutions (1/4) • Share parameters across di ff erent locations (Fukushima, 1980) (LeCun, 1987) 36 / 65

2D Convolutions (2/4) • I t is like applying a fi lter to the image... • ...but the fi lter is trained = ⋆ = ⋆ 39 / 65

2D Convolutions (3/4) • I t is again a matrix-vector operation , but where weights are spatially “ shared ” W● 3 W● 2 W● 1 • As for normal linear layers, can be stacked for higher-level representations 40 / 65

2D Convolutions (4/4) 41 / 65

Spatial Pooling (1/2) • “ Pooling ” (e.g. with a max () operation) increases robustness w.r.t. spatial location 42 / 65

Spatial Pooling (2/2) Controls the capacity • A unit will see “ more ” of the image, for the same number of parameters • adding pooling decreases the size of subsequent fully connected layers! 43 / 65

Image Classification with Deep Networks Ronan Collobert Facebook AI - PowerPoint PPT Presentation

Image Classification with Deep Networks Ronan Collobert Facebook AI Research Feb 11, 2015 Overview Origins of Deep Learning Shallow vs Deep Perceptron Multi Layer Perceptrons Going Deeper Why? Issues (and fix)?

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

Deep learning 8.2. Networks for image classification Fran cois Fleuret

From image classification to object detection Image classification Object detection Image source

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

for Large-Scale Image Classification Karn Simonyan, Andrea Vedaldi, Andrew Zisserman Visual

Hybrid Deep Learning Topology for Image Classification Petru Radu petru.radu@ness.com 27 th

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

AMMI Introduction to Deep Learning 7.2. Networks for image classification Fran cois

Bag-of-features models for category classification for category classification Cordelia Schmid

Bag-of-features for category classification for category classification Cordelia Schmid

Image Processing Todays Class Image Representations: Matrices Image Representations: RGB,

1 Image Classification BVM 2018 Tutorial: Advanced Deep Learning Methods Jakob Wasserthal,

Image Analysis System Example: Image Classification System pre feature feature segmentation

Image Classification with DIGITS NVIDIA Deep Learning Institute 1 DEEP LEARNING INSTITUTE DLI

Disclosures I am currently carrying out treatment trials for those with FXS for CBD (Zynerba),

Collaborators Opaque line Transparent line Vincent J Dercksen Marcel Oberlnder Bert Sakmann

Toward Reliable Bayesian Nonparametric Learning Erik Sudderth Brown University Department of

Catalyst Design for the Electrochemical CO 2 Conversion Peter Broekmann Department of Chemistry

Artifical Neural Networks STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

-Synuclein Toxicity Biology of -Synuclein Dendrites -Synuclein Maroteaux, et al. J.

Neural Networks CS 6355: Structured Prediction Based on slides and material from Geoffrey Hinton,

Hodgkin-Huxley Model of Action Potentials Differential Equations Math 210 Neuron Axon