Effect of Unsupervised Pre-Training in DBN (2/2) without pre-training with pre-training 29
Internal Representation of DBN 30
Representation of Higher Layers Higher layers have more abstract representations Interpolating between different images is not desirable in lo wer layers, but natural in higher layers Bengio et al., ICML 2013
Inference Algorithm of DBN As DBN is a generative model, we can also regenerate the data From the top layer to the bottom, conduct Gibbs sampling to generate th e data samples Occluded Generate data Regenerated Lee, Ng et al., ICML 2009
Applications Nowadays, CNN outperforms DBN for Image or Speech data However, if there is no topological information, DBN is still a good choice Also, if the generative model is needed, DBN is used Generate Face patches Tang, Srivastava, Salakhutdinov, NIPS 2014
CONVOLUTIONAL NEURAL NE TWORKS Slides by Jiseob Kim jkim@bi.snu.ac.kr
Motivation Idea: Fully connected structure has too many parameters to learn Efficient to learn local patterns when there are geometrical, topological structure between features such as image data or voice data (spectrograms) DBN: different data CNN: same data Image 1 Image 2
Structure of Convolutional Neural Network (CNN) Higher features formed by repeated Convolution and Pooling (Subsampling) Convolution obtains certain Feature from local area Pooling reduces Dimension, while obtaining Translation- invariant Feature http://parse.ele.tue.nl/education/cluster2
Convolution Layer The Kernel Detects pattern: 1 0 1 0 1 0 1 0 1 The Resulting value Indicates: How much the pattern matches at each region
Max-Pooling Layer The Pooling Layer summarizes the results of Convolution Layer e.g.) 10x10 result is summarized into 1 cell The Result of Pooling Layer is Tran slation-invariant
Remarks Higher layer • Higher layer catches more Higher layer specific, abstract patterns • Lower layer catches more general patterns
Parameter Learning of CNN CNN is just another Neural Network with sparse connections Learning Algorithm: Back Propagation on Convolution Layers and Fully-Connected Layers Back Propagation
Applications (Image Classification) (1/2) Image Net Competition Ranking (1000-class, 1 million images) ALL CNN!! From Kyunghyun Cho’s dnn tutorial
Applications (Image Classification) (2/2) Krizhevsky et al.: the winner of ImageNet 2012 Competition 1000-class problem, Fully Connected top-5 test error rate of 15.3%
Application (Speech Recognition) Convolutional Neural Network Input: CNN outperforms all previous Spectrogram of Speech methods that uses GMM of MFCC
APPENDIX Slides from Wanli Ouyang wlouyang@ee.cuhk.edu.hk
Good learning resources Webpages: Geoffrey E. Hinton’s readings (with source code available for DBN) http://ww w.cs.toronto.edu/~hinton/csc2515/deeprefs.html Notes on Deep Belief Networks http://www.quantumg.net/dbns.php MLSS Tutorial, October 2010, ANU Canberra, Marcus Frean http://videolectur es.net/mlss2010au_frean_deepbeliefnets/ Deep Learning Tutorials http://deeplearning.net/tutorial/ Hinton’s Tutorial, http://videolectures.net/mlss09uk_hinton_dbn/ Fergus’s Tutorial, http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf CUHK MMlab project : http://mmlab.ie.cuhk.edu.hk/project_deep_learning.h tml People: Geoffrey E. Hinton’s http://www.cs.toronto.edu/~hinton Andrew Ng http://www.cs.stanford.edu/people/ang/index.html Ruslan Salakhutdinov http://www.utstat.toronto.edu/~rsalakhu/ Yee-Whye Teh http://www.gatsby.ucl.ac.uk/~ywteh / Yoshua Bengio www.iro.umontreal.ca/~bengioy Yann LeCun http://yann.lecun.com/ Marcus Frean http://ecs.victoria.ac.nz/Main/MarcusFrean Rob Fergus http://cs.nyu.edu/~fergus/pmwiki/pmwiki.php Acknowledgement Many materials in this ppt are from these papers, tutorials, etc (especially Hinton and Frean’s). Sorry for not listing them in full detail. 45 Dumitru Erhan, Aaron Courville, Yoshua Bengio. Understanding Representations Learned in Deep Architectures. Technical Report.
Graphical model for Statistics Conditional independence b etween random variables Given C, A and B are indepe C ndent: Smoker? P(A, B|C) = P(A|C)P(B|C) P(A,B,C) =P(A, B|C) P(C) B A = P(A|C)P(B|C) P(C) Any two nodes are conditio Has Lung cancer Has bronchitis nally independent given the values of their parents. http://www.eecs.qmul.ac.uk/~norman/BBNs/Independence_and_conditional_independence.htm 46
Directed and undirected graphical m odel C Directed graphical model P(A,B,C) = P(A|C)P(B|C)P(C) B A Any two nodes are conditionally indepe ndent given the values of their parents. Undirected graphical model C P(A,B,C) = P(B,C)P(A,C) Also called Marcov Random Field (MRF) B A C C B A B A P(A,B,C,D) = P(D|A,B)P(B|C)P(A|C)P(C) D 47
Modeling undirected model Probability: x; f ( x ; ) f ( x ; ) P ( ) 1 P ( x; ) f ( x ; ) Z ( ) x x partition function Is smoker? Example: P(A,B,C) = P(B,C)P(A,C) exp( w BC w AC ) C 1 2 P ( A , B , C ; ) exp( w BC w AC ) w 2 w 1 1 2 A , B , C A B exp( w BC ) exp( w AC ) 1 2 Is healthy Z ( w , w ) Has Lung cancer 1 2 48
More directed and undirected models A B C y 1 y 2 y 3 D E F h 1 h 2 h 3 G H I Hidden Marcov model MRF in 2D 49
More directed and undirected models A B y 1 y 2 y 3 C h 1 h 2 h 3 D P( y 1 , y 2 , y 3 , h 1 , h 2 , h 3 )=P( h 1 )P( h 2 | h 1 ) P(A,B,C,D)=P(A)P(B)P(C|B)P(D|A,B,C) P( h 3 | h 2 ) P( y 1 | h 1 )P( y 2 | h 2 )P( y 3 | h 3 ) 50
More directed and undirected models x h 3 ... W 2 h 2 ... HMM W 1 ... h 1 h ... ... W W 0 v x ... ... Our d RBM DBN ( (b) (a) 51
Extended reading on graphical model Zoubin Ghahramani ‘s video lecture on graphical models: http://videolectures.net/mlss07_ghahramani_grafm/ 52
Product of Experts f ( x ; ) m m m E ( x ; ) e f ( x ; ) m P ( x ; ) , E ( x ; ) f ( x ; ) e Z ( ) m m m x m x E ( ; ) log f ( ; ) x x m m m m Partition function Energy function E ( x ; w ) w AB w BC w AD w BE w CF ... 1 2 3 4 3 A B C D E F MRF in 2D G H I 53
Product of Experts 15 T ( x u ) ( x u ) e c ( 1 ) i i i i i 1 54
Products of experts versus Mixture model f ( x ; ) m m m Products of experts : m P ( x ; ) f ( x ; ) m m m x "and" operation m Sharper than mixture Each expert can constrain a different subset of dimensions. Mixture model, e.g. Gaussian Mixture model “or” operation a weighted sum of many density functions 55
Outline Basic background on statistical learning and Gr aphical model Contrastive divergence and Restricte d Boltzmann machine Product of experts Contrastive divergence Restricted Boltzmann Machine Deep belief net 56
Z ( ) f ( x; ) Contrastive Divergence (CD) m x Probability: P ( x; ) f ( x ; ) / Z ( ) Maximum Likelihood and gradient descent K K ( k ) ( k ) max P (x ; ) max L ( X ; ) max log P (x ; ) k 1 k 1 L ( X ; ) L ( X ; ) or 0 t 1 t K 1 (k) log Z ( ) log f ( x ; ) 1 L ( X ; ) K k 1 K (k) K log f ( x ; ) 1 log f ( x ; ) p ( x , ) d x K k 1 log f ( x ; ) log f ( x ; ) p ( x , ) X model dist. 57 data dist. expectation
P(A,B,C) = P(A|C)P(B|C)P(C) C Contrastive Divergence (CD) B A Gradient of Likelihood: (k) K L ( X ; ) log f ( x ; ) 1 log f ( x ; ) p ( x , ) d x K k 1 Intractable Easy to compute Fast contrastive divergence Tractable Gibbs Sampling T=1 Sample p ( z 1 , z 2 ,…, z M ) T L ( X ; ) t 1 t CD Minimum Accurate but slow gradient Approximate but fast 58 gradient
Gibbs Sampling for graphical model h 1 h 5 h 2 h 3 h 4 x 1 x 2 x 3 More information on Gibbs sampling: Pattern recognition and machine learning(PRML) 59
Convergence of Contrastive divergence (CD) The fixed points of ML are not fixed points of CD and vice versa. CD is a biased learning algorithm. But the bias is typically very small. CD can be used for getting close to ML solution and then ML le arning can be used for fine-tuning. It is not clear if CD learning converges (to a stable fixed poi nt). At 2005, proof is not available. Further theoretical results? Please inform us M. A. Carreira-Perpignan and G. E. Hinton. On Contrastive Divergence Learning. Artificial Intelligence and Statistics, 2005 60
Outline Basic background on statistical learning and Gr aphical model Contrastive divergence and Restricte d Boltzmann machine Product of experts Contrastive divergence Restricted Boltzmann Machine Deep belief net 61
Boltzmann Machine Undirected graphical model, wit h hidden nodes. f ( x ; ) m m m E ( x ; ) e f ( x ; ) m P ( x ; ) , E ( x ; ) f ( x ; ) e Z ( ) m m m x x m ) E ( x; w x x x ij i j i i i j i w : { , } ij i Boltzmann machine: E( x,h )= b ' x + c ' h + h ' Wx+x’Ux+h’Vh 62
Boltzmann machine: E( x,h )= b ' x + c ' h + h ' Wx+x’Ux+h’Vh Restricted Boltzmann Machine (RBM) Undirected, loopy, layer h 1 h 2 h 3 h 4 h 5 E ( x , h ) e P ( x , h ) E ( x , h ) e x , h partition E ( x , h ) e x 1 x 2 x 3 function h P ( x ) E ( x , h ) e x , h E(x,h)= b ' x+ c ' h+h' W x h P ( h | x ) P ( h | x ) W i i P ( x | h ) P ( x | h ) x j j P ( x j = 1 | h ) = σ ( b j +W ’ • j · h ) P ( h i = 1 | x ) = σ ( c i +W i · · x ) Read the manuscript for details
Restricted Boltzmann Machine (RBM) ( b' x c' h h' Wx) e f ( x ; ) h P ( x; ) ( b' x c' h h' Wx) e Z ( ) x , h E (x,h)=b' x+c' h+h' Wx x = [ x 1 x 2 …] T , h = [ h 1 h 2 …] T Parameter learning Maximum Log-Likelihood K K ( k ) ( k ) max P (x ; ) min L ( X ; ) min log P (x ; ) k 1 k 1 Geoffrey E. Hinton, “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14, 1771 – 1800 (2002) 64
CD for RBM CD for RBM, very fast! L ( X ; ) ( b' x c' h h' Wx) e f ( x ; ) t 1 t h P ( x; ) ( b' x c' h h' Wx) e Z ( ) x , h (k) K L ( X ; ) log f ( x ; ) 1 log f ( x ; ) p ( x , ) d x w K k 1 ij x h x h x h x h i j i j i j i j p ( x , ) X 0 x h x h CD i j i j 1 0 P ( x j = 1 |h ) = σ ( b j +W ’ • j · h ) P ( h i = 1 |x ) = σ ( c i +W i · x ) 65
L ( X ; ) CD for RBM x h x h i j i j w 1 0 ij P ( x j = 1 |h ) = σ ( b j +W ’ • j · h ) P ( h i = 1 |x ) = σ ( c i +W i · x ) P ( x j = 1 |h ) = σ ( b j +W ’ • j · h ) h 2 h 1 x 1 x 2 P ( x j = 1 |h ) = σ ( b j +W ’ • j · h ) P ( h i = 1 |x ) = σ ( c i +W i · x ) 66
RBM for classification y : classification label 67 Hugo Larochelle and Yoshua Bengio, Classification using Discriminative Restricted Boltzmann Machines, ICML 2008.
RBM itself has many applications Multiclass classification Collaborative filtering Motion capture modeling Information retrieval Modeling natural images Segmentation Y Li, D Tarlow, R Zemel, Exploring compositional high order pattern potentials for structured output learning, CVPR 2013 V. Mnih, H Larochelle, GE Hinton , Conditional Restricted Boltzmann Machines for Structured Output Prediction, Uncertainty in Artificial Intelligence, 2011. Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted boltzmann machines. ICML, 2008. Salakhutdinov, R., Mnih, A., & Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. ICML 2007. Salakhutdinov, R., & Hinton, G. E. (2009). Replicated softmax: an undirected topic model., NIPS 2009. Osindero, S., & Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random field., NIPS 2008 68
Outline Basic background on statistical learning and Gr aphical model Contrastive divergence and Restricted Boltzm ann machine Deep belief net (DBN) Why deep leaning? Learning and inference Applications 69
Belief Nets A belief net is a directed acyclic g random hidden raph composed of random varia cause bles. visible effect 70
Deep Belief Net Belief net that is deep A generative model P(x,h 1 ,…,h l ) = p(x|h 1 ) p(h 1 |h 2 )… p(h l -2 |h l -1 ) p(h l -1 ,h l ) Used for unsupervised training of multi-layer deep mo del. … … h 3 … … h 2 … … h 1 … … Pixels=>edges=> local shapes=> object x parts P(x,h 1 ,h 2 ,h 3 ) = p(x|h 1 ) p(h 1 |h 2 ) p(h 2 ,h 3 ) 71
Why Deep learning? Pixels=>edges=> local shapes=> object parts The mammal brain is organized in a deep architecture wit h a given input percept represented at multiple levels of a bstraction, each level corresponding to a different area of cortex. An architecture with insufficient depth can require many more computational elements, potentially exponentially more (with respect to input size), than architectures whos e depth is matched to the task. Since the number of computational elements one can affo rd depends on the number of training examples available to tune or select them, the consequences are not just com putational but also statistical: poor generalization may be expected when using an insufficiently deep architecture f or representing some functions. T. Serre, etc. , “A quantitative theory of immediate visual recognition,” Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function , vol. 165, pp. 33 – 56, 2007. Yoshua Bengio , “ Learning Deep Architectures for AI, ” Foundations and Trends in Machine Learning , 2009. 72
Why Deep learning? Linear regression, logistic regression: depth 1 Kernel SVM: depth 2 Decision tree: depth 2 Boosting: depth 2 The basic conclusion that these results suggest is that whe n a function can be compactly represented by a deep archit ecture, it might need a very large architecture to be represe nted by an insufficiently deep one . (Example: logic gates, multi-layer NN with linear threshold units and positive we ight). Yoshua Bengio , “ Learning Deep Architectures for AI, ” Foundations and Trends in Machine Learning , 2009. 73
Example: sum product network (SPN) 2 N-1 N 2 N-1 parameters X 1 X 1 X 2 X 4 X 5 X 2 X 3 X 3 X 4 X 5 O(N) parameters 74
Depth of existing approaches Boosting (2 layers) L 1: base learner L 2: vote or linear combination of layer 1 Decision tree, LLE, KNN, Kernel SVM (2 layers) L 1: matching degree to a set of local templates. L 2: Combine these degrees Brain: 5-10 layers b i K ( x , x ) i i 75
Why decision tree has depth 2? Rely on partition of input space. Local estimator. Rely on partition of input space and use separate params for each region. Each r egion is associated with a leaf. Need as many as training samples as there are v ariations of interest in the target function. Not g ood for highly varying functions. Num. training sample is exponential to Num. di m in order to achieve a fixed error rate. 76
Deep Belief Net Inference problem: Infer the states of the unobs erved variables. Learning problem: Adjust the interactions betw een variables to make the network more likely t o generate the observed data … … h 3 … … h 2 … … h 1 … … x P(x,h 1 ,h 2 ,h 3 ) = p(x|h 1 ) p(h 1 |h 2 ) p(h 2 ,h 3 ) 77
Deep Belief Net Inference problem (the problem of explaining away): C P(A,B|C) = P(A|C)P(B|C) B A P( h 11 , h 12 | x 1 ) ≠ P( h 11 | x 1 ) P( h 12 | x 1 ) h 11 h 12 … … h 1 x 1 … … x An example from manuscript Sol: Complementary prior 78
Deep Belief Net Inference problem ( the problem of explaining away) Sol: Complementary prior h 4 30 … … … … h 3 500 … … h 2 1000 … … h 1 2000 … … x Sol: Complementary prior 79
P ( h i = 1 | x) = σ ( c i +W i · x) Deep Belief Net Explaining away problem of Inference (see the manus cript) Sol: Complementary prior, see the manuscript Learning problem Greedy layer by layer RBM training (optimize lower boun d) and fine tuning Contrastive divergence for RBM training … … h 3 … … h 3 … … h 2 … … … … h 2 h 2 … … h 1 … … h 1 … … h 1 … … x … … x 80
Deep Belief Net Why greedy layerwise learning work? Optimizing a lower bound: log P ( x ) log P ( x, h ) 1 h { Q ( h | x )[log P ( h ) log P ( h | x )] Q ( h | x ) log Q ( h | x )]} 1 1 1 1 1 h (1) 1 When we fix parameters for layer 1 an d optimize the parameters for layer 2, … … h 3 we are optimizing the P (h 1 ) in (1) … … h 2 … … h 2 … … h 1 … … h 1 … … x 81
Deep Belief Net and RBM RBM can be considered as DBN that has infinitive layers … … … x 2 T … … W … … h 1 h 0 W W … … … … x 1 x 0 T W … … h 0 W … … x 0 82
Pretrain, fine-tune and inference – (autoencoder) (BP) 83
Pretrain, fine-tune and inference - 2 y: identity or rotation degree Pretraining Fine-tuning 84
How many layers should we use? There might be no universally right depth Bengio suggests that several layers is better than one Results are robust against changes in the size of a laye r, but top layer should be big A parameter. Depends on your task. With enough narrow layers, we can model any distribu tion over binary vectors [1] [1] Sutskever, I. and Hinton, G. E., Deep Narrow Sigmoid Belief Networks are Universal Approximators. Neural Computation, 2007 Copied from http://videolectures.net/mlss09uk_hinton_dbn/ 85
Effect of Unsupervised Pre-training Erhan et. al. AISTATS’2009 86
Effect of Depth without pre-training with pre-training w/o pre-training 87
Why unsupervised pre-training makes sense stuff stuff high low bandwidth bandwidth label label image image If image-label pairs are If image-label pairs were generated this way, it generated this way, it makes sense to first learn would make sense to try to recover the stuff that to go straight from caused the image by images to labels. inverting the high For example, do the bandwidth pathway. pixels have even parity? 88
Beyond layer-wise pretraining Layer-wise pretraining is efficient but not optimal. It is possible to train parameters for all layers using a wake -sleep algorithm. Bottom-up in a layer-wise manner Top-down and reffiting the earlier models 89
Fine-tuning with a contrastive versio n of the “wake - sleep” algorithm After learning many layers of features, we can fine-tune the f eatures to improve generation. 1. Do a stochastic bottom-up pass Adjust the top-down weights to be good at reconstructing the fe ature activities in the layer below. 2. Do a few iterations of sampling in the top level RBM -- Adjust the weights in the top-level RBM. 3. Do a stochastic top-down pass Adjust the bottom-up weights to be good at reconstructing the f eature activities in the layer above. 90
Include lateral connections RBM has no connection among layers This can be generalized. Lateral connections for the first layer [1]. Sampling from P ( h | x ) is still easy. But sampling from p ( x | h ) is more difficult. Lateral connections at multiple layers [2]. Generate more realistic images. CD is still applicable, with small modification. [1]B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: a strategy employed by V1?,” Vision Research, vol. 37, pp. 3311 – 3325, December 1997. [2]S. Osindero and G. E. Hinton, “Modeling image patches with a directed hierarchy of Markov random field,” in NIPS, 2007. 91
Without lateral connection 92
With lateral connection 93
My data is real valued … Make it [0 1] linearly: x = ax + b Use another distribution 94
My data has temporal dependency … Static: Temporal 95
Consider DBN as… A statistical model that is used for unsupervised traini ng of fully connected deep model A directed graphical model that is approximated by fa st learning and inference algorithms A directed graphical model that is fine tuned using ma ture neural network learning approach -- BP. 96
Outline Basic background on statistical learning and Gr aphical model Contrastive divergence and Restricted Boltzm ann machine Deep belief net (DBN) Why DBN? Learning and inference Applications 97
Applications of deep learning Hand written digits recognition Dimensionality reduction Information retrieval Segmentation Denoising Phone recognition Object recognition Object detection … Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks, Science 2006. Welling, M. etc., Exponential Family Harmoniums with an Application to Information Retrieval, NIPS 2004 A. R. Mohamed, etc., Deep Belief Networks for phone recognition, NIPS 09 workshop on deep learning for speech recognition. Nair, V. and Hinton, G. E. 3-D Object recognition with deep belief nets. NIPS09 …………………………. 98
Object recognition NORB logistic regression 19.6%, kNN (k=1) 18.4%, Gaussian kern el SVM 11.6%, convolutional neural net 6.0%, convolution al net + SVM hybrid 5.9%. DBN 6.5%. With the extra unlabeled data (and the same amount of la beled data as before), DBN achieves 5.2%. 99
Learning to extract the orientation of a face p atch (Salakhutdinov & Hinton, NIPS 2007) 100
Recommend
More recommend