cs7015 deep learning lecture 19
play

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling, Latent Variables, Restricted Boltzmann Machines, Unsupervised Learning, Motivation for Sampling Mitesh M. Khapra Department of Computer Science and


  1. CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling, Latent Variables, Restricted Boltzmann Machines, Unsupervised Learning, Motivation for Sampling Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

  2. Acknowledgments Probabilistic Graphical models: Principles and Techniques, Daphne Koller and Nir Friedman An Introduction to Restricted Boltzmann Machines, Asja Fischer and Christian Igel 2/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

  3. Module 19.1: Using joint distributions for classification and sampling 3/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

  4. Now that we have some understanding of joint probability distributions and efficient ways of representing them, let us see some more practical examples where we can use these joint distributions 4/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

  5. Consider a movie critic who writes reviews M1: An unexpected and necessary masterpiece for movies M2: Delightfully merged information and comedy For simplicity let us assume that he always M3: Director’s first true masterpiece M4: Sci-fi perfection,truly mesmerizing film. writes reviews containing a maximum of 5 M5: Waste of time and money words M6: Best Lame Historical Movie Ever Further, let us assume that there are a total of 50 words in his vocabulary Each of the 5 words in his review can be treated as a random variable which takes one of the 50 values Given many such reviews written by the reviewer we could learn the joint probability distribution P ( X 1 , X 2 , . . . , X 5 ) 5/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

  6. In fact, we can even think of a very simple factorization for this model M1: An unexpected and necessary masterpiece P ( X 1 , X 2 , . . . , X 5 ) = � P ( X i | X i − 1 , X i − 2 ) M2: Delightfully merged information and comedy M3: Director’s first true masterpiece M4: Sci-fi perfection,truly mesmerizing film. In other words, we are assuming that the i-th M5: Waste of time and money word only depends on the previous 2 words M6: Best Lame Historical Movie Ever and not anything before that of and Let us consider one such factor P ( X i = time | X i − 2 = waste, X i − 1 = of ) waste We can estimate this as count (waste of time) money time count (waste of) And the two counts mentioned above can be computed by going over all the reviews We could similarly compute the probabilities of all such factors 6/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

  7. Okay, so now what can we do M7: More realistic than real life with this joint distribution? Given a review, classify if this P ( X i = w | , P ( X i = w | , P ( X i = w | w . . . was written by the reviewer X i − 2 = more, X i − 2 = realistic, X i − 2 = than, X i − 1 = realistic ) X i − 1 = than ) X i − 1 = real ) new reviews which Generate than 0.61 0.01 0.20 . . . would look like reviews written as 0.12 0.10 0.16 . . . by this reviewer for 0.14 0.09 0.05 . . . How would you do this? By real 0.01 0.50 0.01 . . . sampling from this distribution! the 0.02 0.12 0.12 . . . What does that mean? Let us life 0.05 0.11 0.33 . . . see! P ( M 7) = P ( X 1 = more ) .P ( X 2 = realistic | X 1 = more ) . P ( X 3 = than | X 1 = more, X 2 = realistic ) . P ( X 4 = real | X 2 = realistic, X 3 = than ) . P ( X 5 = life | X 3 = than, X 4 = real ) = 0 . 2 × 0 . 25 × 0 . 61 × 0 . 50 × 0 . 33 = 0 . 005 7/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

  8. How does the reviewer start his reviews (what is the first word that he chooses)? P ( X i = w | , We could take the word which has the P ( X 2 = w | , w P ( X 1 = w ) . . . X i − 2 = the, highest probability and put it as the X 1 = the ) X i − 1 = movie ) first word in our review the 0.62 0.01 0.01 . . . Having selected this what is the most movie 0.10 0.40 0.01 . . . likely second word that the reviewer amazing 0.01 0.22 0.01 . . . uses? useless 0.01 0.20 0.03 . . . Having selected the first two words was 0.01 0.00 0.60 . . . . . . . what is the most likely third word . . . . . . . . . . . that the reviewer uses? and so on... The movie was really amazing 8/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

  9. But there is a catch here! Selecting the most likely word at each time step will only give us the same P ( X i = w | , review again and again! P ( X 2 = w | , w P ( X 1 = w ) . . . X i − 2 = the, But we would like to generate X 1 = the ) X i − 1 = movie ) different reviews the 0.62 0.01 0.01 . . . So instead of taking the max value we movie 0.10 0.40 0.01 . . . can sample from this distribution amazing 0.01 0.22 0.01 . . . How? Let us see! useless 0.01 0.20 0.03 . . . was 0.01 0.00 0.60 . . . . . . . . . . . . . . . . . . The movie was really amazing 9/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

  10. Suppose there are 10 words in the P ( X i = w | , P ( X 2 = w | , vocabulary w P ( X 1 = w ) . . . X i − 2 = the, X 1 = the ) We have computed the probability X i − 1 = movie ) distribution P ( X 1 = word ) the 0.62 0.01 0.01 . . . P ( X 1 = the ) is the fraction of reviews movie 0.10 0.40 0.01 . . . having the as the first word amazing 0.01 0.22 0.01 . . . useless 0.01 0.20 0.03 . . . Similarly, we have computed was 0.01 0.00 0.60 . . . P ( X 2 = word 2 | X 1 = word 1 ) and is 0.01 0.00 0.30 . . . P ( X 3 = word 3 | X 1 = word 1 , X 2 = word 2 ) masterpiece 0.01 0.11 0.01 . . . I 0.21 0.00 0.01 . . . liked 0.01 0.01 0.01 . . . decent 0.01 0.02 0.01 . . . 10/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

  11. Now consider that we want to generate the 3rd word The movie . . . in the review given the first 2 words of the review We can think of the 10 words as forming a 10 sided dice where each side corresponds to a word The probability of each side showing up is not uniform P ( X i = w | , Index Word X i − 2 = the, . . . but as per the values given in the table X i − 1 = movie ) We can select the next word by rolling this dice and 0 the 0.01 . . . picking up the word which shows up 1 movie 0.01 . . . 2 amazing 0.01 . . . You can write a python program to roll such a biased 3 useless 0.03 . . . dice 4 was 0.60 . . . 5 is 0.30 . . . 6 masterpiece 0.01 . . . 7 I 0.01 . . . 8 liked 0.01 . . . 9 decent 0.01 . . . 11/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

  12. Now, at each timestep we do not Generated Reviews pick the most likely word but all words are possible depending on the movie is liked decent their probability (just as rolling I liked the amazing movie a biased dice or tossing a biased the movie is masterpiece coin) the movie I liked useless Every run will now give us a different review! 12/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

  13. Returning back to our story.... 13/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

  14. Okay, so now what can we do M7: More realistic than real life with this joint distribution? Given a review, classify if this P ( X i = w | , P ( X i = w | , P ( X i = w | w . . . was written by the reviewer X i − 2 = more, X i − 2 = realistic, X i − 2 = than, X i − 1 = realistic ) X i − 1 = than ) X i − 1 = real ) new reviews which Generate than 0.61 0.01 0.20 . . . would look like reviews written as 0.12 0.10 0.16 . . . by this reviewer for 0.14 0.09 0.05 . . . Correct noisy reviews or help in real 0.01 0.50 0.01 . . . completing incomplete reviews the 0.02 0.12 0.12 . . . life 0.05 0.11 0.33 . . . argmax P ( X 1 = the, X 2 = movie, X 5 P ( M 7) = P ( X 1 = more ) .P ( X 2 = realistic | X 1 = more ) . X 3 = was, P ( X 3 = than | X 1 = more, X 2 = realistic ) . X 4 = amazingly, P ( X 4 = real | X 2 = realistic, X 3 = than ) . X 5 =?) P ( X 5 = life | X 3 = than, X 4 = real ) = 0 . 2 × 0 . 25 × 0 . 61 × 0 . 50 × 0 . 33 = 0 . 005 14/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

  15. Let us take an example from another domain 15/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

  16. Consider images which contain m × n pixels (say 32 × 32) Each pixel here is a random variable which can take values from 0 to 255 (colors) We thus have a total of 32 × 32 = 1024 random variables ( X 1 , X 2 , ..., X 1024 ) Together these pixels define the image and different combinations of pixel values lead to different images Given many such images we want to learn the joint distribution P ( X 1 , X 2 , ..., X 1024 ) 16/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

  17. We can assume each pixel is dependent only on its neighbors In this case we could factorize the distribution over a Markov network � φ ( D i ) where D i is a set of variables which form a maximal clique (basically, groups of neighboring pixels) 17/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

  18. Again, what can we do with this joint distribution? Given a new image, classify if is indeed a bedroom Generate new images which would look like bedrooms (say, if you are an interior designer) Correct noisy images or help in completing incomplete images 18/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

Recommend


More recommend