introduction to machine learning
play

Introduction to Machine Learning Undirected Graphical Models - PowerPoint PPT Presentation

Introduction to Machine Learning Undirected Graphical Models Barnabs Pczos Credits Many of these slides are taken from Ruslan Salakhutdinov, Hugo Larochelle, & Eric Xing http://www.dmi.usherb.ca/~larocheh/neural_networks


  1. Introduction to Machine Learning Undirected Graphical Models Barnabás Póczos

  2. Credits Many of these slides are taken from Ruslan Salakhutdinov, Hugo Larochelle, & Eric Xing • http://www.dmi.usherb.ca/~larocheh/neural_networks • http://www.cs.cmu.edu/~rsalakhu/10707/ • http://www.cs.cmu.edu/~epxing/Class/10708/ Reading material: • http://www.cs.cmu.edu/~rsalakhu/papers/Russ_thesis.pdf • Section 30.1 of Information Theory, Inference, and Learning Algorithms by David MacKay • http://www.stat.cmu.edu/~larry/=sml/GraphicalModels.pdf 2

  3. Undirected Graphical Models = Markov Random Fields Probabilistic graphical models: a powerful framework for representing dependency structure between random variables. Markov network (or undirected graphical model) is a set of random variables having a dependency structure described by an undirected graph . Semantic labeling 3

  4. Cliques Clique : a subset of nodes such that there exists a link between all pairs of nodes in a subset. Maximal Clique: a clique such that it is not possible to include any other nodes in the set without it ceasing to be a clique. This graph has two maximal cliques: Other cliques: 4

  5. Undirected Graphical Models = Markov Random Fields Directed graphs are useful for expressing causal relationships between random variables, whereas undirected graphs are useful for expressing dependencies between random variables. The joint distribution defined by the graph is given by the product of non-negative potential functions over the maximal cliques (connected subset of nodes). In this example, the joint distribution factorizes as: 5

  6. Markov Random Fields (MRFs)  Each potential function is a mapping from the joint configurations of random variables in a maximal clique to non- negative real numbers .  The choice of potential functions is not restricted to having specific probabilistic interpretations.  where E(x) is called an energy function. 6

  7. Conditional Independence It follows that the undirected graphical structure represents conditional independence : Theorem : 7

  8. MRFs with Hidden Variables For many interesting problems, we need to introduce hidden or latent variables .  Our random variables will contain both visible and hidden variables x=(v,h)  Computing the Z partition function is intractable  Computing the summation over hidden variables is intractable  Parameter learning is very challenging. 8

  9. Boltzmann Machines Definition: [Boltzmann machines] MRFs with maximum click size two [pairwise (edge) potentials] on binary-valued nodes are called Boltzmann machines The joint probabilities are given by : The parameter θ ij measures the dependence of x i on x j , conditioned on the other nodes . 9

  10. Boltzmann Machines Theorem: One can prove that the conditional distribution of one node conditioned on the others is given by the logistic function in Boltzmann Machines: Proof: 10

  11. Boltzmann Machines Proof [Continued]: Q.E.D. 11

  12. Example: Image Denoising Let us look at the example of noise removal from a binary image. The image is an array of {-1, +1} pixel values.  We take the original noise-free image (x) and randomly flip the sign of pixels with a small probability. This process creates the noisy image (y)  Our goal is to estimate the original image x from the noisy observations y.  We model the joint distribution with 12

  13. Inference: Iterated Conditional Models Goal: Solution : coordinate-wise gradient descent 13

  14. Gaussian MRFs • The information matrix is sparse, but the covariance matrix is not. 14

  15. Restricted Boltzmann Machines 15

  16. Restricted Boltzmann Machines Restricted = no connections in the hidden layer + no connection in the visible layer x Partition function (intractable) 16

  17. Gaussian-Bernoulli RBM [Quadratic in v linear in h] 17

  18. Possible Tasks with RBM Tasks:  Inference:  Evaluate the likelihood function:  Sampling from RBM:  Training RBM: 18

  19. Inference 19

  20. Inference Theorem : Inference in RBM is simple: the conditional distributions are logistic functions x Similarly, 20

  21. Proof: 21

  22. Proof [Continued]: Q.E.D. 22

  23. Evaluating the Likelihood 23

  24. Calculating the Likelihood of an RBM Theorem : Calculating the likelihood is simple in RBM (apart from the partition function) Free energy 24

  25. Proof : Q.E.D. 25

  26. Sampling 26

  27. Sampling from p(x,h) in RBM Goal: Generate samples from Sampling is tricky … it is easier much in directed graphical models. Here we will use Gibbs sampling . x Similarly, 27

  28. Gibbs Sampling: The Problem Suppose that we can generate samples from Our goal is to generate samples from 28

  29. Gibbs Sampling: Pseudo Code 29

  30. Gibbs Sampling 30

  31. Training 31

  32. RBM Training Training is complicated… To train an RBM, we would like to minimize the negative log-likelihood function: To solve this, we use stochastic gradient ascent: Theorem: Negative phase Positive phase (hard to computer) 32

  33. RBM Training Proof: 33

  34. RBM Training Proof [Continued]: First term Second term First term: Difficult to calculate the expectation Negative phase 34

  35. RBM Training Proof [Continued]: 35

  36. RBM Training Proof [Continued]: First term Second term Second term: The conditionals are independent logistic distributions Positive phase 36 Q.E.D

  37. RBM Training Since We need to calculate where 37

  38. RBM Training Since We need to calculate The second term is more tricky. Approximate the expectations with a single sample: where 38

  39. RBM Training Logistic Logistic 39

  40. CD-k (Contrastive Divergence) Pseudocode 40

  41. Results 41

  42. RBM Training Results http://deeplearning.net/tutorial/rbm.html Learned filters Original images Samples generated by the RBM after training. Each row represents a mini-batch of negative particles (samples from independent Gibbs chains). 1000 steps of Gibbs sampling were taken between each of those rows. 42

  43. Summary Tasks:  Inference:  Evaluate the likelihood function:  Sampling from RBM:  Training RBM: 43

  44. Thanks for your Attention!

  45. RBM Training Results 45

  46. Gaussian-Bernoulli RBM Training Results Each document (story) is represented with a bag of world coming from a multinomial distribution with parameters (h = topics). After training we can generate words from this topics. 46

Recommend


More recommend