Deep Learning for Mobile Part I Instructor - Simon Lucey 16-623 - Designing Computer Vision Apps
Today • Single Layer Perceptron • Multi-Layer Perceptron • Convolutional Neural Network
Linear Binary Classification T [65,09,67,.......,78,66,76,215] x ∈ R D x ∈ C 1 w T x + w 0 ≥ < 0 x ∈ C 2 4
Linear Binary Classification T [65,09,67,.......,78,66,76,215] x ∈ R D “Perceptron” x ∈ C 1 w T x + w 0 ≥ < 0 x ∈ C 2 4
Linear Binary Classification T [65,09,67,.......,78,66,76,215] x ∈ R D “Linear x ∈ C 1 Discriminant” w T x + w 0 ≥ < 0 x ∈ C 2 4
Why Linear? • Linear discriminant functions are useful in this regard as the number of required samples is linear with respect to the n dimensionality . D No. of samples Dimensionality ( D )
Why Linear? • Linear discriminant functions are useful in this regard as the number of required samples is linear with respect to the n dimensionality . D No. of samples Dimensionality ( D )
Perceptron • Rosenblatt simulated the perceptron on a IBM 704 computer at Cornell in 1957. • Input scene (i.e. printed character) was illuminated by powerful lights and captured on a 20x20 cadmium sulphide photo cells. • Weights of perceptron were applied using variable rotary resistors. • Often times referred to as the very first neural network. “Frank Rosenblatt”
Perceptron
Linear Discriminant Functions a x 2 y > 0 . y = 0 C 1 pen- R 1 the y < 0 . R 2 C 2 gen- en x w y ( x ) ∥ w ∥ x ⊥ x 1 − w 0 ∥ w ∥
Linear Binary Classification T [65,09,67,.......,78,66,76,215] x ∈ R D x ∈ C 1 w � T x � ≥ < 0 1 w 0 x ∈ C 2 9
Linear Binary Classification T [65,09,67,.......,78,66,76,215] x ∈ R D x ∈ C 1 ≥ w T x < 0 x ∈ C 2 9
Perceptron Linear Discriminant t i = +1 t i = − 1 binary labels x i = i -th training example w = weight vector N X max(0 , t n · x T arg min n w ) w n =1
Perceptron Linear Discriminant t i = +1 t i = − 1 binary labels x i = i -th training example w = weight vector N X max(0 , t n · x T arg min n w ) w n =1
Perceptron Linear Discriminant t i = +1 t i = − 1 binary labels x i = i -th training example w = weight vector N X E ( t n · x T arg min n w ) w n =1
Perceptron Linear Discriminant margin ∝ ( w T w ) − 1 N n w ) + λ X E ( t n · x T 2 || w || 2 arg min 2 w n =1
Other Objectives • Other objectives are possible, E ( z ) least-squares ← || z − 1 || 2 2 hinge ← max(0 , 1 − z ) 1 sigmoid ← 1 + exp( − z ) z − 2 − 1 0 1 2
Optimizing Weights • Expressing the final objective as, N n w ) + λ X E ( t n · x T 2 || w || 2 f ( w ) = 2 n =1 • Simplest strategy is to employ gradient-descent optimization, w → w − η ∂ f ( w ) ∂ w
Optimizing Weights • Expressing the final objective as, N n w ) + λ X E ( t n · x T 2 || w || 2 f ( w ) = 2 n =1 • Simplest strategy is to employ gradient-descent optimization, w → w − η ∂ f ( w ) ∂ w “Learning Rate”
Gradient-Descent Optimization • Works for any function that can have a gradient estimated. • Guaranteed to converge towards local-minima. • Scales well to extremely large amounts of data. • Notoriously slow (linear convergence). • Often guess work associated tuning the learning rate.
Gradient-Descent Optimization • Works for any function that can have a gradient estimated. • Guaranteed to converge towards local-minima. • Scales well to extremely large amounts of data. • Notoriously slow (linear convergence). • Often guess work associated tuning the learning rate.
Optimizing Weights ∂ f ( w ) w 1 w 1 ∂ w 1 . . . . . . + η ← . . . ∂ f ( w ) w K w K ∂ w K
Optimizing Weights ∂ f ( w ) w 1 w 1 ∂ w 1 . . . . . . + η ← . . . ∂ f ( w ) w K w K ∂ w K
Optimizing Weights - Per Sample • Objective nearly always summation over N samples, N X f ( w ) = f n ( w ) n =1 • So one can update the weights per sample, ∂ f n ( w ) w → w − η N ∂ w “Learning Rate”
Single Layer - Example f n ( w ) = 1 2 + λ 2 || 1 − t n · x T n w || 2 2 N || w || 2 2
Single Layer - Example f n ( w ) = 1 2 + λ 2 || 1 − t n · x T n w || 2 2 N || w || 2 2 ∂ f n ( w ) n w − t n ) x n + λ = ( x T N w ∂ w
Today • Single-Layer Perceptron • Multi-Layer Perceptron • Convolutional Neural Network
Shallow Networks • Theorem:!Gaussian!kernel!machines!need!at!least! k !examples! to!learn!a!func:on!that!has! 2k !zeroZcrossings!along!some!line! ! ! ! ! ! • Theorem:!For!a!Gaussian!kernel!machine!to!learn!some! maximally!varying!func:ons!!over! d !inputs!requires!O( 2 d )! examples! ! Y. Bengio, O. Delalleau, and N. Le Roux, “The Curse of Highly Variable Functions for Local Kernel Machines”, NIPS 2006
Hierarchical Learning Simple Complex View-tuned cells Bob Crimi
Hierarchical Learning V1 Ventral Visual Stream V2/V4 IT Simple Complex View-tuned cells Bob Crimi
Hierarchical Learning (Lee,!Grosse,!Ranganath!&!Ng,!ICML!2009)! Successive!model!layers!learn!deeper!intermediate!representa:ons! ! HighZlevel! linguis:c!representa:ons! Layer!3! Parts!combine! to!form!objects! Layer!2! Layer!1! 12! Prior:$underlying$factors$&$concepts$compactly$expressed$w/$mul/ple$levels$of$abstrac/on$ !
Why Deep? • Deep network can be considered as an MLP with several or more hidden layers. • Deeper nets are exponentially more expressive than shallow ones. Shallow Network Deep Network Montufar, Guido F., et al. "On the number of linear regions of deep neural networks." NIPS 2014.
Shallow Computer Program subroutine1 includes subroutine2 includes subsub1 code and subsub2 code and subsub2 code and subsub3 code and subsubsub1 code subsubsub3 code and … main
Deep Computer Program subsubsub2 subsubsub1 subsubsub3 subsub1 subsub2 subsub3 sub1 sub2 sub3 main
Multi-Layer Perceptron
Multi-Layer Perceptron ( M × D ) W (1) x
Multi-Layer Perceptron 1 h ( x ) 0.5 0 -0.5 -1 -4 -3 -2 -1 0 1 2 3 4 ( M × D ) x h ( W (1) x ) W (1) x
Multi-Layer Perceptron 1 h ( x ) 0.5 0 -0.5 -1 -4 -3 -2 -1 0 1 2 3 4 ( M × D ) x h ( W (1) x ) W (1) x
Multi-Layer Perceptron ( M × D ) (1 × M ) T 0 x ∈ C 1 0 ≥ < 0 0 x ∈ C 2 [ w (2) ] T z W (1) x
Multi-Layer Perceptron hidden units o- corre- z M input, w (1) w (2) MD KM x D rep- y K pa- outputs inputs input y 1 direc- x 1 z 1 w (2) 10 x 0 z 0
Layer 1 - MLP h [ x T w (1) 1 ] z 1 . . . . z = ← . . h [ x T w (1) z M M ] h () = non-linear function [ w (1) 1 , . . . , w (1) M ] = 1st layer’s D × M weights x = D × 1 raw input
Layer 2 - MLP T [65,09,67,.......,78,66,76,215] x ∈ R D z ∈ R M z ∈ C 1 ≥ z T w (2) < 0 z ∈ C 2 z = M × 1 output of layer 1 w (2) = 2nd layer’s M × 1 weight vector
Obvious Questions? • How many layers? • Is the solution globally optimal? • What non-linearity should you use? • What learning rate? • How to should I estimate my gradients?
Obvious Questions? • How many layers? • Is the solution globally optimal? • What non-linearity should you use? • What learning rate? • How to should I estimate my gradients?
How Deep? • Recent work has suggested that network depth is crucial for good performance (e.g. ImageNet). • Counter intuitively, naively trained deeper networks tend to have higher train error than shallow networks. • Innovation of residual learning has greatly helped with this. x weight layer F ( x ) relu x weight layer identity F ( x ) � + � x relu Figure 2. Residual learning: a building block. He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).
How Deep? 20 ResNet-20 ResNet-32 ResNet-44 ResNet-56 ResNet-110 error (%) 10 20-layer 110-layer 5 0 0 1 2 3 4 5 6 iter. (1e4) training error, and bold lines denote testing error He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).
Obvious Questions? • How many layers? • Is the solution globally optimal? • What non-linearity should you use? • What learning rate? • How to should I estimate my gradients?
Obvious Questions? • How many layers? • Is the solution globally optimal? • What non-linearity should you use? • What learning rate? • How to should I estimate my gradients?
Recommend
More recommend