Deep Learning (jkim@bi.snu.ac.kr) 2015/05/7 - PowerPoint PPT Presentation

Deep Belief Network (DBN) l Deep Belief Network (Deep Bayesian N etwork) § Bayesian Network that has similar structur e to Neural Network § Generative model § Also, can be used as classifier (with additi onal classifier at top layer) § Resolves gradient vanishing by Pre-trainin g § There are two modes (Classifier & Auto-E ncoder), but we only consider Classifier he re

Learning Algorithm of DBN l DBN as a stack of RBMs Classifier … … h 3 RBM … … DBN h 2 … … … … h 0 h 2 W … … … … x 0 h 1 … … h 1 … … x 1. Regard each layer as RBM 2. Layer-wise Pre-train each RBM in Unsupervised way 3. Attach the classifier and Fine-tune the whole Network in Supervis ed way

Viewing Learning as Wake-Sleep Algorithm

Effect of Unsupervised Pre-Training in DBN (1/2) Erhan et. al. AISTATS’2009 28

Effect of Unsupervised Pre-Training in DBN (2/2) ¡ without pre-training with pre-training 29

Internal ¡Representation ¡of ¡DBN ¡ 30

Representation of Higher Layers l Higher layers have more abstract representations § Interpolating between different images is not desirable in lo wer layers, but natural in higher layers (a) Interpolating between an example and its 200-th nearest neighbor (see caption below). (c) Sequences of points interpolated at di ff erent depths Bengio et al., ICML 2013

Inference Algorithm of DBN l As DBN is a generative model, we can also regenerate the data § From the top layer to the bottom, conduct Gibbs sampling to generate the data samples Occluded Generate data Regenerated Lee, Ng et al., ICML 2009

Applications l Nowadays, CNN outperforms DBN for Image or Speech data l However, if there is no topological information, DBN is still a good choice l Also, if the generative model is needed, DBN is used Generate Face patches Tang, Srivastava, Salakhutdinov, NIPS 2014

CONVOLUTIONAL NEURAL NE TWORKS Slides by Jiseob Kim jkim@bi.snu.ac.kr

Motivation l Idea: § Fully connected 네트워크 구조는 학습해야할 파라미터 수가 너무 많음 § 이미지 데이터 , 음성 데이터 (spectrogram) 과 같이 각 feature 들 간의 위상적 , 기하적 구조가 있는 경우 Local 한 패턴을 학습하 는 것이 효과적 n DBN 의 경우 다른 data n CNN 의 경우 같은 data Image 1 Image 2

Structure of Convolutional Neural Network (CNN) l Convolution 과 Pooling (Subsampling) 을 반복하여 상위 Feat ure 를 구성 l Convolution 은 Local 영역에서의 특정 Feature 를 얻는 과정 l Pooling 은 Dimension 을 줄이면서도 , Translation-invariant 한 Feature 를 얻는 과정 http://parse.ele.tue.nl/education/cluster2

Convolution Layer l The Kernel Detects pattern: 1 0 0 0 1 0 1 0 1 l The Resulting value Indicates: § How much the pattern matches at each region

Max-Pooling Layer l The Pooling Layer summarizes the results of Convolution Layer § e.g.) 10x10 result is summarized into 1 cell l The Result of Pooling Layer is Trans lation-invariant

Remarks Higher layer • Higher layer catches more Higher layer specific, abstract patterns • Lower layer catches more general patterns

Parameter Learning of CNN l CNN is just another Neural Network with sparse connections l Learning Algorithm: § Back Propagation on Convolution Layers and Fully-Connected Layers Back Propagation

Applications (Image Classification) (1/2) Image Net Competition Ranking (1000-class, 1 million images) Top Rankers 1. Clarifi ( 0.117 ): Deep Convolutional Neural Networks (Zeiler) 2. NUS: Deep Convolutional Neural Networks 3. ZF: Deep Convolutional Neural Networks 4. Andrew Howard: Deep Convolutional Neural Networks 5. OverFeat: Deep Convolutional Neural Networks ALL CNN!! 6. UvA-Euvision: Deep Convolutional Neural Networks 7. Adobe: Deep Convolutional Neural Networks 8. VGG: Deep Convolutional Neural Networks 9. CognitiveVision: Deep Convolutional Neural Networks 10. decaf: Deep Convolutional Neural Networks 11. IBM Multimedia Team: Deep Convolutional Neural Networks 12. Deep Punx (0.209): Deep Convolutional Neural Networks 13. MIL (0.244): Local image descriptors + FV + linear classifier (Hidaka et al.) 14. Minerva-MSRA: Deep Convolutional Neural Networks From Kyunghyun Cho’s dnn tutorial

Applications (Image Classification) (2/2) n Krizhevsky et al.: the winner of ImageNet 2012 Competition 1000-class problem, Fully Connected top-5 test error rate of 15.3%

Application (Speech Recognition) Convolutional Neural Network Input: CNN outperforms all previous Spectrogram of Speech methods that uses GMM of MFCC

APPENDIX Slides from Wanli Ouyang wlouyang@ee.cuhk.edu.hk

Good ¡learning ¡resources ¡ l Webpages: ¡ § Geoffrey ¡E. ¡Hinton’s ¡readings ¡(with ¡source ¡code ¡available ¡for ¡DBN) ¡ http://www.cs.toronto.edu/~hinton/csc2515/deeprefs.html ¡ ¡ § Notes ¡on ¡Deep ¡Belief ¡Networks ¡ ¡http://www.quantumg.net/dbns.php ¡ ¡ § MLSS ¡Tutorial, ¡October ¡2010, ¡ANU ¡Canberra, ¡Marcus ¡Frean ¡ http://videolectures.net/mlss2010au_frean_deepbeliefnets/ ¡ ¡ § Deep ¡Learning ¡Tutorials ¡http://deeplearning.net/tutorial/ ¡ ¡ § Hinton’s ¡Tutorial, ¡http://videolectures.net/mlss09uk_hinton_dbn/ ¡ ¡ § Fergus’s ¡Tutorial, ¡http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf ¡ § CUHK ¡MMlab ¡project ¡: ¡ http://mmlab.ie.cuhk.edu.hk/project_deep_learning.html ¡ ¡ ¡ l People: ¡ § Geoffrey ¡E. ¡Hinton’s ¡http://www.cs.toronto.edu/~hinton ¡ § Andrew ¡Ng ¡http://www.cs.stanford.edu/people/ang/index.html ¡ ¡ § Ruslan ¡Salakhutdinov ¡http://www.utstat.toronto.edu/~rsalakhu/ ¡ ¡ § Yee-‑Whye ¡Teh ¡ http://www.gatsby.ucl.ac.uk/~ywteh / ¡ ¡ § Yoshua ¡Bengio ¡www.iro.umontreal.ca/~bengioy ¡ ¡ ¡ ¡ § Yann ¡LeCun ¡ ¡http://yann.lecun.com/ ¡ ¡ § Marcus ¡Frean ¡http://ecs.victoria.ac.nz/Main/MarcusFrean ¡ ¡ § Rob ¡Fergus ¡http://cs.nyu.edu/~fergus/pmwiki/pmwiki.php ¡ ¡ l Acknowledgement ¡ § Many ¡materials ¡in ¡this ¡ppt ¡are ¡from ¡these ¡papers, ¡tutorials, ¡etc ¡(especially ¡ Hinton ¡and ¡Frean’s). ¡Sorry ¡for ¡not ¡listing ¡them ¡in ¡full ¡detail. ¡ 45 Dumitru Erhan, Aaron Courville, Yoshua Bengio. Understanding Representations Learned in Deep Architectures. Technical Report.

Graphical ¡model ¡for ¡Statistics ¡ l Conditional ¡independence ¡b etween ¡random ¡variables ¡ l Given ¡C, ¡A ¡and ¡B ¡are ¡indepe ndent: ¡ C Smoker? § P(A, ¡B|C) ¡= ¡P(A|C)P(B|C) ¡ ¡ l P(A,B,C) ¡=P(A, ¡B|C) ¡P(C) ¡ ¡ B A § = P(A|C)P(B|C) P(C) ¡ Has Lung cancer Has bronchitis l Any ¡two ¡nodes ¡are ¡conditio nally ¡independent ¡given ¡the ¡ values ¡of ¡their ¡parents. ¡ http://www.eecs.qmul.ac.uk/~norman/BBNs/Independence_and_conditional_independence.htm 46

Directed ¡and ¡undirected ¡graphical ¡m odel ¡ C l Directed ¡graphical ¡model ¡ ¡ § P(A,B,C) ¡= ¡P(A|C)P(B|C)P(C) ¡ B A § Any ¡two ¡nodes ¡are ¡ conditionally ¡independent ¡given ¡the ¡val ues ¡of ¡ ¡ ¡ ¡ ¡their ¡parents. ¡ C l Undirected ¡graphical ¡model ¡ § P(A,B,C) ¡= ¡P(B,C)P(A,C) ¡ B A § Also ¡called ¡Marcov ¡Random ¡Field ¡(MRF) ¡ C C B A B A P(A,B,C,D) = P(D|A,B)P(B|C)P(A|C)P(C) 47 D

Modeling ¡undirected ¡model ¡ l Probability: ¡ f ( x ; ) f ( x ; ) P ( x; θ ) 1 θ θ ∑ = = ∑ P ( x; ) θ = f ( x ; ) Z ( ) θ θ x x partition function Is smoker? Example: P(A,B,C) = P(B,C)P(A,C) exp( w BC w AC ) + = ∑ P ( A , B , C ; ) 1 2 C θ exp( w BC w AC ) + w 2 w 1 1 2 A , B , C A B exp( w BC ) exp( w AC ) 1 2 = Z ( w , w ) Is healthy Has Lung cancer 1 2 48

More directed and undirected models A B C y 1 y 2 y 3 D E F h 1 h 2 h 3 G H I Hidden Marcov model MRF in 2D 49

More directed and undirected models A B y 1 y 2 y 3 C h 1 h 2 h 3 D P( y 1 , y 2 , y 3 , h 1 , h 2 , h 3 )=P( h 1 )P( h 2 | h 1 ) P(A,B,C,D)=P(A)P(B)P(C|B)P(D|A,B,C) P( h 3 | h 2 ) P( y 1 | h 1 )P( y 2 | h 2 )P( y 3 | h 3 ) 50

More directed and undirected models x ... h 3 W 2 ... h 2 W HMM W 1 ... ... ... h h 1 W W W 0 ... ... v x Our de RBM DBN (c (b) (a) 51

Extended ¡reading ¡on ¡graphical ¡model l ¡Zoubin ¡Ghahramani ¡‘s ¡video ¡lecture ¡on ¡graphical ¡models: ¡ l http://videolectures.net/mlss07_ghahramani_grafm/ ¡ 52

Product ¡of ¡Experts ¡ ¡ f ( x ; ) ∏ θ m m m E ( x ; ) e − θ f ( x ; ) θ P ( x ; ) m , θ = = = E ( x ; ) f ( x ; ) e Z ( ) − θ ∑ ∏ ∑ θ θ m m m x m x E ( x ; ) log f ( x ; ) ∑ θ = − θ m m m Partition ¡function m Energy function E ( x ; w ) w AB w BC w AD w BE w CF ... = + + + + + 1 2 3 4 3 A B C MRF in 2D D E F 53 G H I

Product ¡of ¡Experts ¡ ¡ 15 [ ] T ( x u ) ( x u ) e − Σ − c ( 1 ) ∏ λ i i + − λ i i i 1 = 54

Products ¡of ¡experts ¡versus ¡Mixture ¡model f ( x ; ) ∏ θ m m m l Products ¡of ¡experts ¡: ¡ m P ( x ; ) θ = f ( x ; ) ∑ ∏ θ m m m x § ¡"and" ¡operation ¡ m § Sharper ¡than ¡mixture ¡ § Each ¡expert ¡can ¡constrain ¡a ¡different ¡subset ¡of ¡dimensions. ¡ l Mixture ¡model, ¡e.g. ¡Gaussian ¡Mixture ¡model ¡ § “or” ¡operation ¡ § a ¡weighted ¡sum ¡of ¡many ¡density ¡functions 55

Outline ¡ l Basic ¡background ¡on ¡statistical ¡learning ¡and ¡Gr aphical ¡model ¡ l Contrastive ¡divergence ¡and ¡Restricte d ¡Boltzmann ¡machine ¡ § Product ¡of ¡experts ¡ § Contrastive ¡divergence ¡ § Restricted ¡Boltzmann ¡Machine ¡ l Deep ¡belief ¡net ¡ 56

Z ( ) f ( x; ) ∑ θ = θ Contrastive ¡Divergence ¡(CD) ¡ m x P ( x; ) f ( x ; ) / Z ( ) l Probability: ¡ θ = θ θ l Maximum ¡Likelihood ¡and ¡gradient ¡descent ¡ K K ⎧ ⎫ ⎧ ⎫ ( k ) ( k ) max P (x ; ) max L ( X ; ) max log P (x ; ) ∏ ∏ θ ⇔ θ = θ ⎨ ⎬ ⎨ ⎬ θ θ θ ⎩ ⎭ ⎩ ⎭ k 1 k 1 = = L ( X ; ) L ( X ; ) ∂ θ ∂ θ or 0 θ = θ + λ = t 1 t + ∂ θ ∂ θ K 1 ⎧ ⎫ (k) log Z ( ) log f ( x ; ) ∑ ∂ θ − θ ⎨ ⎬ 1 L ( X ; ) K ∂ θ ⎩ ⎭ k 1 = = K ∂ θ ∂ θ (k) K log f ( x ; ) 1 log f ( x ; ) ∂ θ ∂ θ p ( x , ) d x ∑ = ∫ θ − K ∂ θ ∂ θ k 1 = log f ( x ; ) log f ( x ; ) ∂ θ ∂ θ = − ∂ θ ∂ θ p ( x , ) X θ 57 model dist. data dist. expectation

P(A,B,C) = P(A|C)P(B|C)P(C) C Contrastive ¡Divergence ¡(CD) ¡ B A l Gradient ¡of ¡Likelihood: ¡ (k) L ( X ; ) log f ( x ; ) 1 K log f ( x ; ) ∂ θ ∂ θ ∂ θ p ( x , ) d x ∑ = θ − ∫ K ∂ θ ∂ θ ∂ θ k 1 = Intractable Easy to compute Fast contrastive divergence Tractable Gibbs Sampling T=1 Sample p ( z 1 , z 2 ,…, z M ) T => ∞ L ( X ; ) ∂ θ θ = θ + λ t 1 t + ∂ θ CD Minimum Accurate but slow gradient 58 Approximate but fast gradient

Gibbs ¡Sampling ¡for ¡graphical ¡model h 1 h 5 h 2 h 3 h 4 x 1 x 2 x 3 More information on Gibbs sampling: Pattern recognition and machine learning(PRML) 59

Convergence ¡of ¡Contrastive ¡divergence ¡(CD) l The ¡fixed ¡points ¡of ¡ML ¡are ¡not ¡fixed ¡points ¡of ¡CD ¡and ¡vice ¡ versa. ¡ ¡ § CD ¡is ¡a ¡biased ¡learning ¡algorithm. ¡ § But ¡the ¡bias ¡is ¡typically ¡very ¡small. ¡ § CD ¡can ¡be ¡used ¡for ¡getting ¡close ¡to ¡ML ¡solution ¡and ¡then ¡ML ¡le arning ¡can ¡be ¡used ¡for ¡fine-‑tuning. ¡ l It ¡is ¡not ¡clear ¡if ¡CD ¡learning ¡converges ¡(to ¡a ¡stable ¡fixed ¡poi nt). ¡At ¡2005, ¡proof ¡is ¡not ¡available. ¡ l Further ¡theoretical ¡results? ¡Please ¡inform ¡us M. A. Carreira-Perpignan and G. E. Hinton. On Contrastive Divergence Learning. Artificial Intelligence and Statistics, 2005 60

Outline ¡ l Basic ¡background ¡on ¡statistical ¡learning ¡and ¡Gr aphical ¡model ¡ l Contrastive ¡divergence ¡and ¡Restricte d ¡Boltzmann ¡machine ¡ § Product ¡of ¡experts ¡ § Contrastive ¡divergence ¡ § Restricted ¡Boltzmann ¡Machine ¡ l Deep ¡belief ¡net ¡ 61

Boltzmann ¡Machine l Undirected ¡graphical ¡model, ¡with ¡hidden ¡nodes. f ( x ; ) ∏ θ m m m E ( x ; ) e f ( x ; ) − θ θ P ( x ; ) m , θ = = = E ( x ; ) f ( x ; ) e Z ( ) ∑ ∏ ∑ − θ θ θ m m m x m x E ( x; θ ) w x x x ∑ ∑ = − − λ ij i j i i i j i < : { w λ , } θ ij i Boltzmann machine: E( x,h )= b ' x + c ' h + h ' Wx+x’Ux+h’Vh 62

Boltzmann machine: E( x,h )= b ' x + c ' h + h ' Wx+x’Ux+h’Vh Restricted ¡Boltzmann ¡Machine ¡(RBM) ¡ l Undirected, ¡loopy, ¡layer ¡ h 1 h 2 h 3 h 4 h 5 E ( x , h ) e − P ( x , h ) = E ( x , h ) e − ∑ x , h partition E ( x , h ) e − ∑ x 1 x 2 x 3 function h P ( x ) = E ( x , h ) e − ∑ x , h l E(x,h)= b ' x+ c ' h+h' W x ¡ h P ( h | x ) P ( h | x ) ∏ W = i i P ( x | h ) P ( x | h ) ∏ x = j j P ( x j = 1 | h ) = σ ( b j +W’ • j · h ) Read the manuscript for details P ( h i = 1 | x ) = σ ( c i +W i · · x )

Restricted ¡Boltzmann ¡Machine ¡(RBM) ¡ ( b' x c' h h' Wx) e − + + ∑ f ( x ; ) θ = ∑ P ( x; ) h θ = ( b' x c' h h' Wx) e Z ( ) − + + θ x , h l E (x,h)=b' x+c' h+h' Wx ¡ l x = [ x 1 x 2 …] T , h = [ h 1 h 2 …] T ¡ l Parameter ¡learning ¡ § Maximum ¡Log-‑Likelihood ¡ K K ⎧ ⎫ ⎧ − ⎫ ( k ) ( k ) max P (x ; ) min L ( X ; ) min log P (x ; ) ∏ ∏ θ ⇔ θ = θ ⎨ ⎬ ⎨ ⎬ θ θ θ ⎩ ⎭ ⎩ ⎭ k 1 k 1 = = Geoffrey E. Hinton, “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14, 1771–1800 (2002) 64

CD ¡for ¡RBM ¡ l CD ¡for ¡RBM, ¡very ¡fast! ¡ L ( X ; ) ∂ θ ( b' x c' h h' Wx) e − + + ∑ θ = θ + λ f ( x ; ) θ t 1 t + = ∑ P ( x; ) h ∂ θ θ = ( b' x c' h h' Wx) e Z ( ) − + + θ x , h (k) K L ( X ; ) log f ( x ; ) 1 log f ( x ; ) ∂ θ ∂ θ ∂ θ p ( x , ) d x ∑ = ∫ θ − w K ∂ ∂ θ ∂ θ k 1 ij = x h x h x h x h = − = − i j i j i j i j p ( x , ) X 0 θ ∞ x h x h CD ≈ − i j i j 1 0 P ( x j = 1 |h ) = σ ( b j +W’ • j · h ) P ( h i = 1 |x ) = σ ( c i +W i · x ) 65

L ( X ; ) ∂ θ CD ¡for ¡RBM x h x h ≈ − i j i j w 1 0 ∂ ij P ( x j = 1 |h ) = σ ( b j +W’ • j · h ) P ( h i = 1 |x ) = σ ( c i +W i · x ) P ( x j = 1 |h ) = σ ( b j +W’ • j · h ) h 2 h 1 x 1 x 2 P ( x j = 1 |h ) = σ ( b j +W’ • j · h ) P ( h i = 1 |x ) = σ ( c i +W i · x ) 66

RBM ¡for ¡classification l y : ¡classification ¡label 67 Hugo Larochelle and Yoshua Bengio, Classification using Discriminative Restricted Boltzmann Machines, ICML 2008.

RBM ¡itself ¡has ¡many ¡applications l Multiclass ¡classification ¡ l Collaborative ¡filtering ¡ l Motion ¡capture ¡modeling ¡ l Information ¡retrieval ¡ l Modeling ¡natural ¡images ¡ l Segmentation Y Li, D Tarlow, R Zemel, Exploring compositional high order pattern potentials for structured output learning, CVPR 2013 V. Mnih, H Larochelle, GE Hinton , Conditional Restricted Boltzmann Machines for Structured Output Prediction, Uncertainty in Artificial Intelligence, 2011. Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted boltzmann machines. ICML, 2008. Salakhutdinov, R., Mnih, A., & Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. ICML 2007. Salakhutdinov, R., & Hinton, G. E. (2009). Replicated softmax: an undirected topic model., NIPS 2009. Osindero, S., & Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random field., NIPS 2008 68

Outline ¡ l Basic ¡background ¡on ¡statistical ¡learning ¡and ¡Gr aphical ¡model ¡ l Contrastive ¡divergence ¡and ¡Restricted ¡Boltzma nn ¡machine ¡ l Deep ¡belief ¡net ¡(DBN) ¡ § Why ¡ deep ¡leaning? ¡ § Learning ¡and ¡inference ¡ § Applications ¡ 69

¡Belief ¡Nets ¡ l A ¡belief ¡net ¡is ¡a ¡directed ¡acyclic ¡g random hidden raph ¡composed ¡of ¡random ¡variab cause les. ¡ visible effect 70

Deep ¡Belief ¡Net ¡ l Belief ¡net ¡that ¡is ¡deep ¡ l A ¡generative ¡model ¡ § P(x,h 1 ,…,h l ) ¡= ¡p(x|h 1 ) ¡p(h 1 |h 2 )… ¡p(h l -2 |h l -1 ) ¡p(h l -1 ,h l ) ¡ l Used ¡for ¡unsupervised ¡training ¡ ¡of ¡multi-‑layer ¡deep ¡mo del. ¡ h 3 … … h 2 … … … … h 1 … … x Pixels=>edges=> local shapes=> object parts P(x,h 1 ,h 2 ,h 3 ) = p(x|h 1 ) p(h 1 |h 2 ) p(h 2 ,h 3 ) 71

Why ¡ Deep ¡learning? ¡ Pixels=>edges=> local shapes=> object parts l The ¡mammal ¡brain ¡is ¡organized ¡in ¡a ¡deep ¡architecture ¡wit h ¡a ¡given ¡input ¡percept ¡represented ¡at ¡multiple ¡levels ¡of ¡a bstraction, ¡each ¡level ¡corresponding ¡to ¡a ¡different ¡area ¡of ¡ cortex. ¡ ¡ l An ¡architecture ¡with ¡insufficient ¡depth ¡can ¡require ¡many ¡ more ¡computational ¡elements, ¡potentially ¡exponentially ¡ more ¡(with ¡respect ¡to ¡input ¡size), ¡than ¡architectures ¡whos e ¡depth ¡is ¡matched ¡to ¡the ¡task. ¡ l Since ¡the ¡number ¡of ¡computational ¡elements ¡one ¡can ¡affo rd ¡depends ¡on ¡the ¡number ¡of ¡training ¡examples ¡available ¡t o ¡tune ¡or ¡select ¡them, ¡the ¡consequences ¡are ¡not ¡just ¡comp utational ¡but ¡also ¡statistical: ¡poor ¡generalization ¡may ¡be ¡e xpected ¡when ¡using ¡an ¡insufficiently ¡deep ¡architecture ¡for ¡ representing ¡some ¡functions. ¡ T. Serre, etc., “A quantitative theory of immediate visual recognition,” Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function , vol. 165, pp. 33–56, 2007. Yoshua Bengio, “Learning Deep Architectures for AI, ” Foundations and Trends in Machine Learning , 2009. 72

Why Deep learning? l Linear ¡regression, ¡logistic ¡regression: ¡ ¡depth ¡1 ¡ l Kernel ¡SVM: ¡depth ¡2 ¡ l Decision ¡tree: ¡depth ¡2 ¡ l Boosting: ¡depth ¡2 ¡ l The ¡basic ¡conclusion ¡that ¡these ¡results ¡suggest ¡is ¡that ¡ whe n ¡a ¡function ¡can ¡be ¡compactly ¡represented ¡by ¡a ¡deep ¡archit ecture, ¡it ¡might ¡need ¡a ¡very ¡large ¡architecture ¡to ¡be ¡represe nted ¡by ¡an ¡insufficiently ¡deep ¡one . ¡(Example: ¡logic ¡gates, ¡ multi-‑layer ¡NN ¡with ¡linear ¡threshold ¡units ¡and ¡positive ¡we ight). ¡ Yoshua Bengio, “Learning Deep Architectures for AI, ” Foundations and Trends in Machine Learning , 2009. 73

Example: ¡sum ¡product ¡network ¡(SPN) 2 N-1 ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ N ⋅ 2 N-1 parameters ⎯ ⎯ ⎯ ⎯ ⎯ X 1 X 1 X 2 X 4 X 5 X 2 X 3 X 3 X 4 X 5 O(N) parameters 74

Depth ¡of ¡existing ¡approaches l Boosting ¡(2 ¡layers) ¡ § L ¡1: ¡base ¡learner ¡ § L ¡2: ¡vote ¡or ¡linear ¡combination ¡of ¡layer ¡1 ¡ l Decision ¡tree, ¡ ¡LLE, ¡KNN, ¡Kernel ¡SVM ¡(2 ¡layers) ¡ § L ¡1: ¡matching ¡degree ¡to ¡a ¡set ¡of ¡local ¡templates. ¡ § L ¡2: ¡Combine ¡these ¡degrees ¡ l Brain: ¡5-‑10 ¡layers b i K ( x , x ) ∑ + α i i 75

Why ¡decision ¡tree ¡has ¡depth ¡2? l Rely ¡on ¡partition ¡of ¡input ¡space. ¡ l Local ¡estimator. ¡Rely ¡on ¡partition ¡of ¡input ¡space ¡ and ¡use ¡separate ¡params ¡for ¡each ¡region. ¡Each ¡r egion ¡is ¡associated ¡with ¡a ¡leaf. ¡ l Need ¡as ¡many ¡as ¡training ¡samples ¡as ¡there ¡are ¡v ariations ¡of ¡interest ¡in ¡the ¡target ¡function. ¡Not ¡g ood ¡for ¡highly ¡varying ¡functions. ¡ l Num. ¡training ¡sample ¡is ¡exponential ¡to ¡Num. ¡di m ¡in ¡order ¡to ¡achieve ¡a ¡fixed ¡error ¡rate. 76

Deep ¡Belief ¡Net ¡ l Inference ¡problem: ¡Infer ¡the ¡states ¡of ¡the ¡unobs erved ¡variables. ¡ l Learning ¡problem: ¡Adjust ¡the ¡interactions ¡betw een ¡variables ¡to ¡make ¡the ¡network ¡more ¡likely ¡t o ¡generate ¡the ¡observed ¡data ¡ h 3 … … h 2 … … … … h 1 … … x P(x,h 1 ,h 2 ,h 3 ) = p(x|h 1 ) p(h 1 |h 2 ) p(h 2 ,h 3 ) 77

Deep ¡Belief ¡Net ¡ § Inference ¡problem ¡(the ¡problem ¡of ¡explaining ¡away): ¡ C n P(A,B|C) = P(A|C)P(B|C) B A = n P( h 11 , h 12 | x 1 ) ≠ P( h 11 | x 1 ) P( h 12 | x 1 ) h 11 h 12 h 1 … … x 1 … … x An example from manuscript Sol: Complementary prior 78

Deep ¡Belief ¡Net ¡ n Inference ¡problem ¡( the ¡problem ¡ of ¡explaining ¡away) ¡ q Sol: Complementary prior ¡ h 4 30 … … h 3 … … 500 h 2 … … 1000 … … 2000 h 1 … … x Sol: Complementary prior 79

P ( h i = 1 | x) = σ ( c i +W i · x) Deep ¡Belief ¡Net ¡ l Explaining ¡away ¡problem ¡of ¡Inference ¡(see ¡the ¡manus cript) ¡ § Sol: ¡Complementary ¡prior, ¡see ¡the ¡manuscript ¡ l Learning ¡problem ¡ § Greedy ¡layer ¡by ¡layer ¡RBM ¡training ¡(optimize ¡lower ¡boun d) ¡and ¡fine ¡tuning ¡ § Contrastive ¡divergence ¡for ¡RBM ¡training ¡ … … h 3 h 3 … … … … h 2 h 2 … … … … h 2 … … h 1 … … h 1 … … h 1 … … x … … x 80

Deep ¡Belief ¡Net l Why ¡greedy ¡layerwise ¡learning ¡work? ¡ l Optimizing ¡a ¡lower ¡bound: ¡ log P ( x ) log P ( x, h ) ∑ = 1 h { Q ( h | x )[log P ( h ) log P ( h | x )] Q ( h | x ) log Q ( h | x )]} ∑ ≥ + − 1 1 1 1 1 h (1) 1 l When ¡we ¡fix ¡parameters ¡for ¡layer ¡1 ¡an d ¡optimize ¡the ¡parameters ¡for ¡layer ¡2, … … h 3 ¡we ¡are ¡optimizing ¡the ¡ P (h 1 ) ¡in ¡(1) ¡ … … h 2 … … h 2 … … h 1 … … h 1 … … x 81

Deep ¡Belief ¡Net ¡and ¡RBM ¡ l RBM ¡can ¡be ¡considered ¡as ¡DBN ¡that ¡has ¡infinitive ¡layers ¡ … … … x 2 T W … … … … h 1 h 0 W W … … … … x 1 x 0 T W … … h 0 W … … x 0 82

Pretrain, ¡fine-‑tune ¡and ¡inference ¡– ¡(autoencoder) (BP) 83

Pretrain, ¡fine-‑tune ¡and ¡inference ¡-‑ ¡2 y: ¡identity ¡or ¡rotation ¡degree Pretraining Fine-tuning 84

How ¡many ¡layers ¡should ¡we ¡use? l There ¡might ¡be ¡no ¡universally ¡right ¡depth ¡ § Bengio ¡suggests ¡that ¡several ¡layers ¡is ¡better ¡than ¡one ¡ § Results ¡are ¡robust ¡against ¡changes ¡in ¡the ¡size ¡of ¡a ¡laye r, ¡but ¡top ¡layer ¡should ¡be ¡big ¡ § A ¡parameter. ¡Depends ¡on ¡your ¡task. ¡ § With ¡enough ¡narrow ¡layers, ¡we ¡can ¡model ¡any ¡distribu tion ¡over ¡binary ¡vectors ¡[1] [1] Sutskever, I. and Hinton, G. E., Deep Narrow Sigmoid Belief Networks are Universal Approximators. Neural Computation, 2007 Copied from http://videolectures.net/mlss09uk_hinton_dbn/ 85

Effect ¡of ¡Unsupervised ¡Pre-‑training ¡ Erhan et. al. AISTATS’2009 86

Effect ¡of ¡Depth ¡ without pre-training with pre-training w/o pre-training 87

Why ¡unsupervised ¡pre-‑training ¡makes ¡sense ¡ stuff stuff high low bandwidth bandwidth label label image image If image-label pairs are If image-label pairs were generated this way, it generated this way, it makes sense to first learn would make sense to try to recover the stuff that to go straight from caused the image by images to labels. inverting the high For example, do the bandwidth pathway. pixels have even parity? 88

Beyond ¡layer-‑wise ¡pretraining l Layer-‑wise ¡pretraining ¡is ¡efficient ¡but ¡not ¡optimal. ¡ ¡ l It ¡is ¡possible ¡to ¡train ¡parameters ¡for ¡all ¡layers ¡using ¡a ¡wake -‑sleep ¡algorithm. ¡ § Bottom-‑up ¡in ¡a ¡layer-‑wise ¡manner ¡ § Top-‑down ¡and ¡reffiting ¡the ¡earlier ¡models ¡ 89

Fine-‑tuning ¡with ¡a ¡contrastive ¡versio n ¡of ¡the ¡“wake-‑sleep” ¡algorithm ¡ ¡ ¡ ¡ ¡After ¡learning ¡many ¡layers ¡of ¡features, ¡we ¡can ¡fine-‑tune ¡the ¡f eatures ¡to ¡improve ¡generation. ¡ 1. ¡ ¡Do ¡a ¡stochastic ¡bottom-‑up ¡pass ¡ § Adjust ¡the ¡top-‑down ¡weights ¡to ¡be ¡good ¡at ¡reconstructing ¡the ¡fe ature ¡activities ¡in ¡the ¡layer ¡below. ¡ 2. ¡ ¡Do ¡a ¡few ¡iterations ¡of ¡sampling ¡in ¡the ¡top ¡level ¡RBM ¡ -‑-‑ ¡Adjust ¡the ¡weights ¡in ¡the ¡top-‑level ¡RBM. ¡ 3. ¡ ¡Do ¡a ¡stochastic ¡top-‑down ¡pass ¡ § Adjust ¡the ¡bottom-‑up ¡weights ¡to ¡be ¡good ¡at ¡reconstructing ¡the ¡f eature ¡activities ¡in ¡the ¡layer ¡above. ¡ 90

Include ¡lateral ¡connections l RBM ¡has ¡no ¡connection ¡among ¡layers ¡ l This ¡can ¡be ¡generalized. ¡ l Lateral ¡connections ¡for ¡the ¡first ¡layer ¡[1]. ¡ ¡ § Sampling ¡from ¡ P ( h | x ) ¡is ¡still ¡easy. ¡But ¡sampling ¡from ¡ p ( x | h ) ¡is ¡more ¡difficult. ¡ l Lateral ¡connections ¡at ¡multiple ¡layers ¡[2]. ¡ § Generate ¡more ¡realistic ¡images. ¡ § CD ¡is ¡still ¡applicable, ¡with ¡small ¡modification. ¡ [1]B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: a strategy employed by V1?,” Vision Research, vol. 37, pp. 3311–3325, December 1997. 91 [2]S. Osindero and G. E. Hinton, “Modeling image patches with a directed hierarchy of Markov random field,” in NIPS, 2007.

Without ¡lateral ¡connection 92

With ¡lateral ¡connection 93

My ¡data ¡is ¡real ¡valued ¡… l Make ¡it ¡[0 ¡1] ¡linearly: ¡ x = ax + b l Use ¡another ¡distribution 94

My ¡data ¡has ¡temporal ¡dependency ¡… l Static: ¡ l Temporal 95

Consider ¡DBN ¡as… l A ¡statistical ¡model ¡that ¡is ¡used ¡for ¡unsupervised ¡traini ng ¡of ¡fully ¡connected ¡deep ¡model ¡ l A ¡directed ¡graphical ¡model ¡that ¡is ¡approximated ¡by ¡fa st ¡learning ¡and ¡inference ¡algorithms ¡ l A ¡directed ¡graphical ¡model ¡that ¡is ¡fine ¡tuned ¡using ¡ma ture ¡neural ¡network ¡learning ¡approach ¡-‑-‑ ¡BP. ¡ 96

Outline ¡ l Basic ¡background ¡on ¡statistical ¡learning ¡and ¡Gr aphical ¡model ¡ l Contrastive ¡divergence ¡and ¡Restricted ¡Boltzma nn ¡machine ¡ l Deep ¡belief ¡net ¡(DBN) ¡ § Why ¡DBN? ¡ § Learning ¡and ¡inference ¡ § Applications ¡ 97

Applications ¡of ¡deep ¡learning ¡ l Hand ¡written ¡digits ¡recognition ¡ l Dimensionality ¡reduction ¡ l Information ¡retrieval ¡ ¡ l Segmentation ¡ l Denoising ¡ l Phone ¡recognition ¡ l Object ¡recognition ¡ l Object ¡detection ¡ l … ¡ Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks, Science 2006. Welling, M. etc., Exponential Family Harmoniums with an Application to Information Retrieval, NIPS 2004 A. R. Mohamed, etc., Deep Belief Networks for phone recognition, NIPS 09 workshop on deep learning for speech recognition. Nair, V. and Hinton, G. E. 3-D Object recognition with deep belief nets. NIPS09 …………………………. 98

Object ¡recognition ¡ l NORB ¡ ¡ § logistic ¡regression ¡19.6%, ¡kNN ¡(k=1) ¡18.4%, ¡Gaussian ¡kern el ¡SVM ¡11.6%, ¡convolutional ¡neural ¡net ¡6.0%, ¡convolution al ¡net ¡+ ¡SVM ¡hybrid ¡5.9%. ¡DBN ¡6.5%. ¡ § With ¡the ¡extra ¡unlabeled ¡data ¡(and ¡the ¡same ¡amount ¡of ¡la beled ¡data ¡as ¡before), ¡DBN ¡achieves ¡5.2%. 99

Learning ¡to ¡extract ¡the ¡orientation ¡of ¡a ¡face ¡p atch ¡ (Salakhutdinov ¡& ¡Hinton, ¡NIPS ¡2007) ¡ 100

Deep Learning (jkim@bi.snu.ac.kr) 2015/05/7 - PowerPoint PPT Presentation

Deep Learning (jkim@bi.snu.ac.kr) 2015/05/7 1 History of Neural Network Research Neural network Deep belief net Back propagation Science

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

Advanced Thermodynamics: Lecture 19 Shivasubramanian Gopalakrishnan sgopalak@iitb.ac.in

Lepton Flavour Violation M. Hirsch mahirsch@ific.uv.es Astroparticle and High Energy Physics

6/12/2018 About the course Thermal and Multispectral Imaging

Power Lecturer: Gil Rahav Semester B , EE Dept. BGU. Freescale Semiconductors Israel

Simulation of Stand-alone Photovoltaic System using Python Arjun Sanu M, B. Kanoj, Vijaybabu and

Autoencoders David Dohan So far: supervised models Multilayer perceptrons (MLP)

A fluid of diffusing particles and its cosmological behaviour Zbigniew Haba Institute of

logic is everywhere Associative Memories la l ogica est a por todas partes Symmetric

Deep Learning (jkim@bi.snu.ac.kr) 2015/05/7 - PowerPoint PPT Presentation

Deep Learning (jkim@bi.snu.ac.kr) 2015/05/7 1 History of Neural Network Research Neural network Deep belief net Back propagation Science

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

Advanced Thermodynamics: Lecture 19 Shivasubramanian Gopalakrishnan sgopalak@iitb.ac.in

Lepton Flavour Violation M. Hirsch mahirsch@ific.uv.es Astroparticle and High Energy Physics

6/12/2018 About the course Thermal and Multispectral Imaging

Power Lecturer: Gil Rahav Semester B , EE Dept. BGU. Freescale Semiconductors Israel

Simulation of Stand-alone Photovoltaic System using Python Arjun Sanu M, B. Kanoj, Vijaybabu and

Autoencoders David Dohan So far: supervised models Multilayer perceptrons (MLP)

A fluid of diffusing particles and its cosmological behaviour Zbigniew Haba Institute of

logic is everywhere Associative Memories la l ogica est a por todas partes Symmetric

Deep learning for natural language processing A short primer on deep learning Benoit Favre <