deep learning
play

Deep Learning Jiseob Kim (jkim@bi.snu.ac.kr) Artificial - PowerPoint PPT Presentation

Deep Learning Jiseob Kim (jkim@bi.snu.ac.kr) Artificial Intelligence Class of 2016 Spring Dept. of Computer Science and Engineering Seoul National University 1 History of Neural Network Research Neural network Deep belief net Back


  1. Effect of Unsupervised Pre-Training in DBN (2/2) without pre-training with pre-training 29

  2. Internal Representation of DBN 30

  3. Representation of Higher Layers  Higher layers have more abstract representations  Interpolating between different images is not desirable in lo wer layers, but natural in higher layers Bengio et al., ICML 2013

  4. Inference Algorithm of DBN  As DBN is a generative model, we can also regenerate the data  From the top layer to the bottom, conduct Gibbs sampling to generate th e data samples Occluded Generate data Regenerated Lee, Ng et al., ICML 2009

  5. Applications  Nowadays, CNN outperforms DBN for Image or Speech data  However, if there is no topological information, DBN is still a good choice  Also, if the generative model is needed, DBN is used Generate Face patches Tang, Srivastava, Salakhutdinov, NIPS 2014

  6. CONVOLUTIONAL NEURAL NE TWORKS Slides by Jiseob Kim jkim@bi.snu.ac.kr

  7. Motivation  Idea:  Fully connected structure has too many parameters to learn  Efficient to learn local patterns when there are geometrical, topological structure between features such as image data or voice data (spectrograms)  DBN: different data  CNN: same data Image 1 Image 2

  8. Structure of Convolutional Neural Network (CNN)  Higher features formed by repeated Convolution and Pooling (Subsampling)  Convolution obtains certain Feature from local area  Pooling reduces Dimension, while obtaining Translation- invariant Feature http://parse.ele.tue.nl/education/cluster2

  9. Convolution Layer  The Kernel Detects pattern: 1 0 1 0 1 0 1 0 1  The Resulting value Indicates:  How much the pattern matches at each region

  10. Max-Pooling Layer  The Pooling Layer summarizes the results of Convolution Layer  e.g.) 10x10 result is summarized into 1 cell  The Result of Pooling Layer is Tran slation-invariant

  11. Remarks Higher layer • Higher layer catches more Higher layer specific, abstract patterns • Lower layer catches more general patterns

  12. Parameter Learning of CNN  CNN is just another Neural Network with sparse connections  Learning Algorithm:  Back Propagation on Convolution Layers and Fully-Connected Layers Back Propagation

  13. Applications (Image Classification) (1/2) Image Net Competition Ranking (1000-class, 1 million images) ALL CNN!! From Kyunghyun Cho’s dnn tutorial

  14. Applications (Image Classification) (2/2)  Krizhevsky et al.: the winner of ImageNet 2012 Competition 1000-class problem, Fully Connected top-5 test error rate of 15.3%

  15. Application (Speech Recognition) Convolutional Neural Network Input: CNN outperforms all previous Spectrogram of Speech methods that uses GMM of MFCC

  16. APPENDIX Slides from Wanli Ouyang wlouyang@ee.cuhk.edu.hk

  17. Good learning resources  Webpages:  Geoffrey E. Hinton’s readings (with source code available for DBN) http://ww w.cs.toronto.edu/~hinton/csc2515/deeprefs.html  Notes on Deep Belief Networks http://www.quantumg.net/dbns.php  MLSS Tutorial, October 2010, ANU Canberra, Marcus Frean http://videolectur es.net/mlss2010au_frean_deepbeliefnets/  Deep Learning Tutorials http://deeplearning.net/tutorial/  Hinton’s Tutorial, http://videolectures.net/mlss09uk_hinton_dbn/  Fergus’s Tutorial, http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf  CUHK MMlab project : http://mmlab.ie.cuhk.edu.hk/project_deep_learning.h tml  People:  Geoffrey E. Hinton’s http://www.cs.toronto.edu/~hinton  Andrew Ng http://www.cs.stanford.edu/people/ang/index.html  Ruslan Salakhutdinov http://www.utstat.toronto.edu/~rsalakhu/  Yee-Whye Teh http://www.gatsby.ucl.ac.uk/~ywteh /  Yoshua Bengio www.iro.umontreal.ca/~bengioy  Yann LeCun http://yann.lecun.com/  Marcus Frean http://ecs.victoria.ac.nz/Main/MarcusFrean  Rob Fergus http://cs.nyu.edu/~fergus/pmwiki/pmwiki.php  Acknowledgement  Many materials in this ppt are from these papers, tutorials, etc (especially Hinton and Frean’s). Sorry for not listing them in full detail. 45 Dumitru Erhan, Aaron Courville, Yoshua Bengio. Understanding Representations Learned in Deep Architectures. Technical Report.

  18. Graphical model for Statistics  Conditional independence b etween random variables  Given C, A and B are indepe C ndent: Smoker?  P(A, B|C) = P(A|C)P(B|C)  P(A,B,C) =P(A, B|C) P(C) B A  = P(A|C)P(B|C) P(C)  Any two nodes are conditio Has Lung cancer Has bronchitis nally independent given the values of their parents. http://www.eecs.qmul.ac.uk/~norman/BBNs/Independence_and_conditional_independence.htm 46

  19. Directed and undirected graphical m odel C  Directed graphical model  P(A,B,C) = P(A|C)P(B|C)P(C) B A  Any two nodes are conditionally indepe ndent given the values of their parents.  Undirected graphical model C  P(A,B,C) = P(B,C)P(A,C)  Also called Marcov Random Field (MRF) B A C C B A B A P(A,B,C,D) = P(D|A,B)P(B|C)P(A|C)P(C) D 47

  20. Modeling undirected model  Probability:    x;     f ( x ; ) f ( x ; ) P ( ) 1   P ( x; )   f ( x ; ) Z ( ) x x partition function Is smoker? Example: P(A,B,C) = P(B,C)P(A,C)    exp( w BC w AC )  C 1 2 P ( A , B , C ; )  exp( w BC w AC ) w 2 w 1 1 2 A , B , C A B exp( w BC ) exp( w AC )  1 2 Is healthy Z ( w , w ) Has Lung cancer 1 2 48

  21. More directed and undirected models A B C y 1 y 2 y 3 D E F h 1 h 2 h 3 G H I Hidden Marcov model MRF in 2D 49

  22. More directed and undirected models A B y 1 y 2 y 3 C h 1 h 2 h 3 D P( y 1 , y 2 , y 3 , h 1 , h 2 , h 3 )=P( h 1 )P( h 2 | h 1 ) P(A,B,C,D)=P(A)P(B)P(C|B)P(D|A,B,C) P( h 3 | h 2 ) P( y 1 | h 1 )P( y 2 | h 2 )P( y 3 | h 3 ) 50

  23. More directed and undirected models x h 3 ... W 2 h 2 ... HMM W 1 ... h 1 h ... ... W W 0 v x ... ... Our d RBM DBN ( (b) (a) 51

  24. Extended reading on graphical model  Zoubin Ghahramani ‘s video lecture on graphical models:  http://videolectures.net/mlss07_ghahramani_grafm/ 52

  25. Product of Experts   f ( x ; )    m m m E ( x ; ) e f ( x ; )     m P ( x ; ) ,        E ( x ; ) f ( x ; ) e Z ( ) m m m x m x      E ( ; ) log f ( ; ) x x m m m m Partition function Energy function       E ( x ; w ) w AB w BC w AD w BE w CF ... 1 2 3 4 3 A B C D E F MRF in 2D G H I 53

  26. Product of Experts   15      T    ( x u ) ( x u ) e c ( 1 ) i i i i  i 1 54

  27. Products of experts versus Mixture model   f ( x ; ) m m m    Products of experts : m P ( x ; )    f ( x ; ) m m m x  "and" operation m  Sharper than mixture  Each expert can constrain a different subset of dimensions.  Mixture model, e.g. Gaussian Mixture model  “or” operation  a weighted sum of many density functions 55

  28. Outline  Basic background on statistical learning and Gr aphical model  Contrastive divergence and Restricte d Boltzmann machine  Product of experts  Contrastive divergence  Restricted Boltzmann Machine  Deep belief net 56

  29.     Z ( ) f ( x; ) Contrastive Divergence (CD) m x      Probability: P ( x; ) f ( x ; ) / Z ( )  Maximum Likelihood and gradient descent     K K        ( k ) ( k )     max P (x ; ) max L ( X ; ) max log P (x ; )          k 1 k 1     L ( X ; ) L ( X ; )       or 0      t 1 t   K 1      (k)   log Z ( ) log f ( x ; )     1 L ( X ; ) K   k 1     K     (k) K log f ( x ; ) 1 log f ( x ; )      p ( x , ) d x     K  k 1     log f ( x ; ) log f ( x ; )        p ( x , ) X model dist. 57 data dist. expectation

  30. P(A,B,C) = P(A|C)P(B|C)P(C) C Contrastive Divergence (CD) B A  Gradient of Likelihood:       (k) K L ( X ; ) log f ( x ; ) 1 log f ( x ; )      p ( x , ) d x       K  k 1 Intractable Easy to compute Fast contrastive divergence Tractable Gibbs Sampling T=1 Sample p ( z 1 , z 2 ,…, z M )   T   L ( X ; )       t 1 t   CD Minimum Accurate but slow gradient Approximate but fast 58 gradient

  31. Gibbs Sampling for graphical model h 1 h 5 h 2 h 3 h 4 x 1 x 2 x 3 More information on Gibbs sampling: Pattern recognition and machine learning(PRML) 59

  32. Convergence of Contrastive divergence (CD)  The fixed points of ML are not fixed points of CD and vice versa.  CD is a biased learning algorithm.  But the bias is typically very small.  CD can be used for getting close to ML solution and then ML le arning can be used for fine-tuning.  It is not clear if CD learning converges (to a stable fixed poi nt). At 2005, proof is not available.  Further theoretical results? Please inform us M. A. Carreira-Perpignan and G. E. Hinton. On Contrastive Divergence Learning. Artificial Intelligence and Statistics, 2005 60

  33. Outline  Basic background on statistical learning and Gr aphical model  Contrastive divergence and Restricte d Boltzmann machine  Product of experts  Contrastive divergence  Restricted Boltzmann Machine  Deep belief net 61

  34. Boltzmann Machine  Undirected graphical model, wit h hidden nodes.   f ( x ; )    m m m E ( x ; ) e f ( x ; )     m P ( x ; ) ,        E ( x ; ) f ( x ; ) e Z ( ) m m m x x m    )     E ( x; w x x x ij i j i i  i j i  w  : { , } ij i Boltzmann machine: E( x,h )= b ' x + c ' h + h ' Wx+x’Ux+h’Vh 62

  35. Boltzmann machine: E( x,h )= b ' x + c ' h + h ' Wx+x’Ux+h’Vh Restricted Boltzmann Machine (RBM)  Undirected, loopy, layer h 1 h 2 h 3 h 4 h 5  E ( x , h ) e  P ( x , h )   E ( x , h ) e x , h   partition E ( x , h ) e x 1 x 2 x 3 function  h P ( x )   E ( x , h ) e x , h  E(x,h)= b ' x+ c ' h+h' W x h   P ( h | x ) P ( h | x ) W i i   P ( x | h ) P ( x | h ) x j j P ( x j = 1 | h ) = σ ( b j +W ’ • j · h ) P ( h i = 1 | x ) = σ ( c i +W i · · x ) Read the manuscript for details

  36. Restricted Boltzmann Machine (RBM)     ( b' x c' h h' Wx) e    f ( x ; )   h P ( x; )     ( b' x c' h h' Wx) e Z ( ) x , h  E (x,h)=b' x+c' h+h' Wx  x = [ x 1 x 2 …] T , h = [ h 1 h 2 …] T  Parameter learning  Maximum Log-Likelihood      K K        ( k ) ( k )     max P (x ; ) min L ( X ; ) min log P (x ; )          k 1 k 1 Geoffrey E. Hinton, “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14, 1771 – 1800 (2002) 64

  37. CD for RBM  CD for RBM, very fast!   L ( X ; )          ( b' x c' h h' Wx) e     f ( x ; )   t 1 t   h P ( x; )     ( b' x c' h h' Wx) e Z ( ) x , h       (k) K L ( X ; ) log f ( x ; ) 1 log f ( x ; )      p ( x , ) d x      w K  k 1 ij     x h x h x h x h i j i j i j i j   p ( x , ) X 0   x h x h CD i j i j 1 0 P ( x j = 1 |h ) = σ ( b j +W ’ • j · h ) P ( h i = 1 |x ) = σ ( c i +W i · x ) 65

  38.   L ( X ; )   CD for RBM x h x h  i j i j w 1 0 ij P ( x j = 1 |h ) = σ ( b j +W ’ • j · h ) P ( h i = 1 |x ) = σ ( c i +W i · x ) P ( x j = 1 |h ) = σ ( b j +W ’ • j · h ) h 2 h 1 x 1 x 2 P ( x j = 1 |h ) = σ ( b j +W ’ • j · h ) P ( h i = 1 |x ) = σ ( c i +W i · x ) 66

  39. RBM for classification  y : classification label 67 Hugo Larochelle and Yoshua Bengio, Classification using Discriminative Restricted Boltzmann Machines, ICML 2008.

  40. RBM itself has many applications  Multiclass classification  Collaborative filtering  Motion capture modeling  Information retrieval  Modeling natural images  Segmentation Y Li, D Tarlow, R Zemel, Exploring compositional high order pattern potentials for structured output learning, CVPR 2013 V. Mnih, H Larochelle, GE Hinton , Conditional Restricted Boltzmann Machines for Structured Output Prediction, Uncertainty in Artificial Intelligence, 2011. Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted boltzmann machines. ICML, 2008. Salakhutdinov, R., Mnih, A., & Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. ICML 2007. Salakhutdinov, R., & Hinton, G. E. (2009). Replicated softmax: an undirected topic model., NIPS 2009. Osindero, S., & Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random field., NIPS 2008 68

  41. Outline  Basic background on statistical learning and Gr aphical model  Contrastive divergence and Restricted Boltzm ann machine  Deep belief net (DBN)  Why deep leaning?  Learning and inference  Applications 69

  42. Belief Nets  A belief net is a directed acyclic g random hidden raph composed of random varia cause bles. visible effect 70

  43. Deep Belief Net  Belief net that is deep  A generative model  P(x,h 1 ,…,h l ) = p(x|h 1 ) p(h 1 |h 2 )… p(h l -2 |h l -1 ) p(h l -1 ,h l )  Used for unsupervised training of multi-layer deep mo del. … … h 3 … … h 2 … … h 1 … … Pixels=>edges=> local shapes=> object x parts P(x,h 1 ,h 2 ,h 3 ) = p(x|h 1 ) p(h 1 |h 2 ) p(h 2 ,h 3 ) 71

  44. Why Deep learning? Pixels=>edges=> local shapes=> object parts  The mammal brain is organized in a deep architecture wit h a given input percept represented at multiple levels of a bstraction, each level corresponding to a different area of cortex.  An architecture with insufficient depth can require many more computational elements, potentially exponentially more (with respect to input size), than architectures whos e depth is matched to the task.  Since the number of computational elements one can affo rd depends on the number of training examples available to tune or select them, the consequences are not just com putational but also statistical: poor generalization may be expected when using an insufficiently deep architecture f or representing some functions. T. Serre, etc. , “A quantitative theory of immediate visual recognition,” Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function , vol. 165, pp. 33 – 56, 2007. Yoshua Bengio , “ Learning Deep Architectures for AI, ” Foundations and Trends in Machine Learning , 2009. 72

  45. Why Deep learning?  Linear regression, logistic regression: depth 1  Kernel SVM: depth 2  Decision tree: depth 2  Boosting: depth 2  The basic conclusion that these results suggest is that whe n a function can be compactly represented by a deep archit ecture, it might need a very large architecture to be represe nted by an insufficiently deep one . (Example: logic gates, multi-layer NN with linear threshold units and positive we ight). Yoshua Bengio , “ Learning Deep Architectures for AI, ” Foundations and Trends in Machine Learning , 2009. 73

  46. Example: sum product network (SPN) 2 N-1                  N  2 N-1 parameters      X 1 X 1 X 2 X 4 X 5 X 2 X 3 X 3 X 4 X 5 O(N) parameters 74

  47. Depth of existing approaches  Boosting (2 layers)  L 1: base learner  L 2: vote or linear combination of layer 1  Decision tree, LLE, KNN, Kernel SVM (2 layers)  L 1: matching degree to a set of local templates.  L 2: Combine these degrees  Brain: 5-10 layers    b i K ( x , x ) i i 75

  48. Why decision tree has depth 2?  Rely on partition of input space.  Local estimator. Rely on partition of input space and use separate params for each region. Each r egion is associated with a leaf.  Need as many as training samples as there are v ariations of interest in the target function. Not g ood for highly varying functions.  Num. training sample is exponential to Num. di m in order to achieve a fixed error rate. 76

  49. Deep Belief Net  Inference problem: Infer the states of the unobs erved variables.  Learning problem: Adjust the interactions betw een variables to make the network more likely t o generate the observed data … … h 3 … … h 2 … … h 1 … … x P(x,h 1 ,h 2 ,h 3 ) = p(x|h 1 ) p(h 1 |h 2 ) p(h 2 ,h 3 ) 77

  50. Deep Belief Net  Inference problem (the problem of explaining away): C  P(A,B|C) = P(A|C)P(B|C) B A  P( h 11 , h 12 | x 1 ) ≠ P( h 11 | x 1 ) P( h 12 | x 1 ) h 11 h 12 … … h 1 x 1 … … x An example from manuscript Sol: Complementary prior 78

  51. Deep Belief Net  Inference problem ( the problem of explaining away)  Sol: Complementary prior h 4 30 … … … … h 3 500 … … h 2 1000 … … h 1 2000 … … x Sol: Complementary prior 79

  52. P ( h i = 1 | x) = σ ( c i +W i · x) Deep Belief Net  Explaining away problem of Inference (see the manus cript)  Sol: Complementary prior, see the manuscript  Learning problem  Greedy layer by layer RBM training (optimize lower boun d) and fine tuning  Contrastive divergence for RBM training … … h 3 … … h 3 … … h 2 … … … … h 2 h 2 … … h 1 … … h 1 … … h 1 … … x … … x 80

  53. Deep Belief Net  Why greedy layerwise learning work?  Optimizing a lower bound:   log P ( x ) log P ( x, h ) 1 h     { Q ( h | x )[log P ( h ) log P ( h | x )] Q ( h | x ) log Q ( h | x )]} 1 1 1 1 1 h (1) 1  When we fix parameters for layer 1 an d optimize the parameters for layer 2, … … h 3 we are optimizing the P (h 1 ) in (1) … … h 2 … … h 2 … … h 1 … … h 1 … … x 81

  54. Deep Belief Net and RBM  RBM can be considered as DBN that has infinitive layers … … … x 2 T … … W … … h 1 h 0 W W … … … … x 1 x 0 T W … … h 0 W … … x 0 82

  55. Pretrain, fine-tune and inference – (autoencoder) (BP) 83

  56. Pretrain, fine-tune and inference - 2 y: identity or rotation degree Pretraining Fine-tuning 84

  57. How many layers should we use?  There might be no universally right depth  Bengio suggests that several layers is better than one  Results are robust against changes in the size of a laye r, but top layer should be big  A parameter. Depends on your task.  With enough narrow layers, we can model any distribu tion over binary vectors [1] [1] Sutskever, I. and Hinton, G. E., Deep Narrow Sigmoid Belief Networks are Universal Approximators. Neural Computation, 2007 Copied from http://videolectures.net/mlss09uk_hinton_dbn/ 85

  58. Effect of Unsupervised Pre-training Erhan et. al. AISTATS’2009 86

  59. Effect of Depth without pre-training with pre-training w/o pre-training 87

  60. Why unsupervised pre-training makes sense stuff stuff high low bandwidth bandwidth label label image image If image-label pairs are If image-label pairs were generated this way, it generated this way, it makes sense to first learn would make sense to try to recover the stuff that to go straight from caused the image by images to labels. inverting the high For example, do the bandwidth pathway. pixels have even parity? 88

  61. Beyond layer-wise pretraining  Layer-wise pretraining is efficient but not optimal.  It is possible to train parameters for all layers using a wake -sleep algorithm.  Bottom-up in a layer-wise manner  Top-down and reffiting the earlier models 89

  62. Fine-tuning with a contrastive versio n of the “wake - sleep” algorithm After learning many layers of features, we can fine-tune the f eatures to improve generation. 1. Do a stochastic bottom-up pass  Adjust the top-down weights to be good at reconstructing the fe ature activities in the layer below. 2. Do a few iterations of sampling in the top level RBM -- Adjust the weights in the top-level RBM. 3. Do a stochastic top-down pass  Adjust the bottom-up weights to be good at reconstructing the f eature activities in the layer above. 90

  63. Include lateral connections  RBM has no connection among layers  This can be generalized.  Lateral connections for the first layer [1].  Sampling from P ( h | x ) is still easy. But sampling from p ( x | h ) is more difficult.  Lateral connections at multiple layers [2].  Generate more realistic images.  CD is still applicable, with small modification. [1]B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: a strategy employed by V1?,” Vision Research, vol. 37, pp. 3311 – 3325, December 1997. [2]S. Osindero and G. E. Hinton, “Modeling image patches with a directed hierarchy of Markov random field,” in NIPS, 2007. 91

  64. Without lateral connection 92

  65. With lateral connection 93

  66. My data is real valued …  Make it [0 1] linearly: x = ax + b  Use another distribution 94

  67. My data has temporal dependency …  Static:  Temporal 95

  68. Consider DBN as…  A statistical model that is used for unsupervised traini ng of fully connected deep model  A directed graphical model that is approximated by fa st learning and inference algorithms  A directed graphical model that is fine tuned using ma ture neural network learning approach -- BP. 96

  69. Outline  Basic background on statistical learning and Gr aphical model  Contrastive divergence and Restricted Boltzm ann machine  Deep belief net (DBN)  Why DBN?  Learning and inference  Applications 97

  70. Applications of deep learning  Hand written digits recognition  Dimensionality reduction  Information retrieval  Segmentation  Denoising  Phone recognition  Object recognition  Object detection  … Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks, Science 2006. Welling, M. etc., Exponential Family Harmoniums with an Application to Information Retrieval, NIPS 2004 A. R. Mohamed, etc., Deep Belief Networks for phone recognition, NIPS 09 workshop on deep learning for speech recognition. Nair, V. and Hinton, G. E. 3-D Object recognition with deep belief nets. NIPS09 …………………………. 98

  71. Object recognition  NORB  logistic regression 19.6%, kNN (k=1) 18.4%, Gaussian kern el SVM 11.6%, convolutional neural net 6.0%, convolution al net + SVM hybrid 5.9%. DBN 6.5%.  With the extra unlabeled data (and the same amount of la beled data as before), DBN achieves 5.2%. 99

  72. Learning to extract the orientation of a face p atch (Salakhutdinov & Hinton, NIPS 2007) 100

Recommend


More recommend