deep learning for mobile part i
play

Deep Learning for Mobile Part I Instructor - Simon Lucey 16-623 - - PowerPoint PPT Presentation

Deep Learning for Mobile Part I Instructor - Simon Lucey 16-623 - Designing Computer Vision Apps Today Single Layer Perceptron Multi-Layer Perceptron Convolutional Neural Network Linear Binary Classification T


  1. Deep Learning for Mobile Part I Instructor - Simon Lucey 16-623 - Designing Computer Vision Apps

  2. Today • Single Layer Perceptron • Multi-Layer Perceptron • Convolutional Neural Network

  3. Linear Binary Classification T [65,09,67,.......,78,66,76,215] x ∈ R D x ∈ C 1 w T x + w 0 ≥ < 0 x ∈ C 2 4

  4. Linear Binary Classification T [65,09,67,.......,78,66,76,215] x ∈ R D “Perceptron” x ∈ C 1 w T x + w 0 ≥ < 0 x ∈ C 2 4

  5. Linear Binary Classification T [65,09,67,.......,78,66,76,215] x ∈ R D “Linear x ∈ C 1 Discriminant” w T x + w 0 ≥ < 0 x ∈ C 2 4

  6. Why Linear? • Linear discriminant functions are useful in this regard as the number of required samples is linear with respect to the n dimensionality . D No. of samples Dimensionality ( D )

  7. Why Linear? • Linear discriminant functions are useful in this regard as the number of required samples is linear with respect to the n dimensionality . D No. of samples Dimensionality ( D )

  8. Perceptron • Rosenblatt simulated the perceptron on a IBM 704 computer at Cornell in 1957. • Input scene (i.e. printed character) was illuminated by powerful lights and captured on a 20x20 cadmium sulphide photo cells. • Weights of perceptron were applied using variable rotary resistors. • Often times referred to as the very first neural network. “Frank Rosenblatt”

  9. Perceptron

  10. Linear Discriminant Functions a x 2 y > 0 . y = 0 C 1 pen- R 1 the y < 0 . R 2 C 2 gen- en x w y ( x ) ∥ w ∥ x ⊥ x 1 − w 0 ∥ w ∥

  11. Linear Binary Classification T [65,09,67,.......,78,66,76,215] x ∈ R D x ∈ C 1  w � T  x � ≥ < 0 1 w 0 x ∈ C 2 9

  12. Linear Binary Classification T [65,09,67,.......,78,66,76,215] x ∈ R D x ∈ C 1 ≥ w T x < 0 x ∈ C 2 9

  13. Perceptron Linear Discriminant t i = +1 t i = − 1 binary labels x i = i -th training example w = weight vector N X max(0 , t n · x T arg min n w ) w n =1

  14. Perceptron Linear Discriminant t i = +1 t i = − 1 binary labels x i = i -th training example w = weight vector N X max(0 , t n · x T arg min n w ) w n =1

  15. Perceptron Linear Discriminant t i = +1 t i = − 1 binary labels x i = i -th training example w = weight vector N X E ( t n · x T arg min n w ) w n =1

  16. Perceptron Linear Discriminant margin ∝ ( w T w ) − 1 N n w ) + λ X E ( t n · x T 2 || w || 2 arg min 2 w n =1

  17. Other Objectives • Other objectives are possible, E ( z ) least-squares ← || z − 1 || 2 2 hinge ← max(0 , 1 − z ) 1 sigmoid ← 1 + exp( − z ) z − 2 − 1 0 1 2

  18. Optimizing Weights • Expressing the final objective as, N n w ) + λ X E ( t n · x T 2 || w || 2 f ( w ) = 2 n =1 • Simplest strategy is to employ gradient-descent optimization, w → w − η ∂ f ( w ) ∂ w

  19. Optimizing Weights • Expressing the final objective as, N n w ) + λ X E ( t n · x T 2 || w || 2 f ( w ) = 2 n =1 • Simplest strategy is to employ gradient-descent optimization, w → w − η ∂ f ( w ) ∂ w “Learning Rate”

  20. Gradient-Descent Optimization • Works for any function that can have a gradient estimated. • Guaranteed to converge towards local-minima. • Scales well to extremely large amounts of data. • Notoriously slow (linear convergence). • Often guess work associated tuning the learning rate.

  21. Gradient-Descent Optimization • Works for any function that can have a gradient estimated. • Guaranteed to converge towards local-minima. • Scales well to extremely large amounts of data. • Notoriously slow (linear convergence). • Often guess work associated tuning the learning rate.

  22. Optimizing Weights   ∂ f ( w )     w 1 w 1 ∂ w 1 . . .   . . .      + η  ←   . . .     ∂ f ( w ) w K w K ∂ w K

  23. Optimizing Weights   ∂ f ( w )     w 1 w 1 ∂ w 1 . . .   . . .      + η  ←   . . .     ∂ f ( w ) w K w K ∂ w K

  24. Optimizing Weights - Per Sample • Objective nearly always summation over N samples, N X f ( w ) = f n ( w ) n =1 • So one can update the weights per sample, ∂ f n ( w ) w → w − η N ∂ w “Learning Rate”

  25. Single Layer - Example f n ( w ) = 1 2 + λ 2 || 1 − t n · x T n w || 2 2 N || w || 2 2

  26. Single Layer - Example f n ( w ) = 1 2 + λ 2 || 1 − t n · x T n w || 2 2 N || w || 2 2 ∂ f n ( w ) n w − t n ) x n + λ = ( x T N w ∂ w

  27. Today • Single-Layer Perceptron • Multi-Layer Perceptron • Convolutional Neural Network

  28. Shallow Networks • Theorem:!Gaussian!kernel!machines!need!at!least! k !examples! to!learn!a!func:on!that!has! 2k !zeroZcrossings!along!some!line! ! ! ! ! ! • Theorem:!For!a!Gaussian!kernel!machine!to!learn!some! maximally!varying!func:ons!!over! d !inputs!requires!O( 2 d )! examples! ! Y. Bengio, O. Delalleau, and N. Le Roux, “The Curse of Highly Variable Functions for Local Kernel Machines”, NIPS 2006

  29. Hierarchical Learning Simple Complex View-tuned cells Bob Crimi

  30. Hierarchical Learning V1 Ventral Visual Stream V2/V4 IT Simple Complex View-tuned cells Bob Crimi

  31. Hierarchical Learning (Lee,!Grosse,!Ranganath!&!Ng,!ICML!2009)! Successive!model!layers!learn!deeper!intermediate!representa:ons! ! HighZlevel! linguis:c!representa:ons! Layer!3! Parts!combine! to!form!objects! Layer!2! Layer!1! 12! Prior:$underlying$factors$&$concepts$compactly$expressed$w/$mul/ple$levels$of$abstrac/on$ !

  32. Why Deep? • Deep network can be considered as an MLP with several or more hidden layers. • Deeper nets are exponentially more expressive than shallow ones. Shallow Network Deep Network Montufar, Guido F., et al. "On the number of linear regions of deep neural networks." NIPS 2014.

  33. Shallow Computer Program subroutine1 includes subroutine2 includes subsub1 code and subsub2 code and subsub2 code and subsub3 code and subsubsub1 code subsubsub3 code and … main

  34. Deep Computer Program subsubsub2 subsubsub1 subsubsub3 subsub1 subsub2 subsub3 sub1 sub2 sub3 main

  35. Multi-Layer Perceptron

  36. Multi-Layer Perceptron ( M × D ) W (1) x

  37. Multi-Layer Perceptron 1 h ( x ) 0.5 0 -0.5 -1 -4 -3 -2 -1 0 1 2 3 4 ( M × D ) x h ( W (1) x ) W (1) x

  38. Multi-Layer Perceptron 1 h ( x ) 0.5 0 -0.5 -1 -4 -3 -2 -1 0 1 2 3 4 ( M × D ) x h ( W (1) x ) W (1) x

  39. Multi-Layer Perceptron ( M × D ) (1 × M ) T   0 x ∈ C 1 0 ≥ < 0   0 x ∈ C 2 [ w (2) ] T z W (1) x

  40. Multi-Layer Perceptron hidden units o- corre- z M input, w (1) w (2) MD KM x D rep- y K pa- outputs inputs input y 1 direc- x 1 z 1 w (2) 10 x 0 z 0

  41. Layer 1 - MLP   h [ x T w (1)   1 ] z 1 . .   . . z =    ← . .      h [ x T w (1) z M M ] h () = non-linear function [ w (1) 1 , . . . , w (1) M ] = 1st layer’s D × M weights x = D × 1 raw input

  42. Layer 2 - MLP T [65,09,67,.......,78,66,76,215] x ∈ R D z ∈ R M z ∈ C 1 ≥ z T w (2) < 0 z ∈ C 2 z = M × 1 output of layer 1 w (2) = 2nd layer’s M × 1 weight vector

  43. Obvious Questions? • How many layers? • Is the solution globally optimal? • What non-linearity should you use? • What learning rate? • How to should I estimate my gradients?

  44. Obvious Questions? • How many layers? • Is the solution globally optimal? • What non-linearity should you use? • What learning rate? • How to should I estimate my gradients?

  45. How Deep? • Recent work has suggested that network depth is crucial for good performance (e.g. ImageNet). • Counter intuitively, naively trained deeper networks tend to have higher train error than shallow networks. • Innovation of residual learning has greatly helped with this. x weight layer F ( x ) relu x weight layer identity F ( x ) � + � x relu Figure 2. Residual learning: a building block. He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).

  46. How Deep? 20 ResNet-20 ResNet-32 ResNet-44 ResNet-56 ResNet-110 error (%) 10 20-layer 110-layer 5 0 0 1 2 3 4 5 6 iter. (1e4) training error, and bold lines denote testing error He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).

  47. Obvious Questions? • How many layers? • Is the solution globally optimal? • What non-linearity should you use? • What learning rate? • How to should I estimate my gradients?

  48. Obvious Questions? • How many layers? • Is the solution globally optimal? • What non-linearity should you use? • What learning rate? • How to should I estimate my gradients?

Recommend


More recommend