department of computer science csci 5622 machine learning
play

Department of Computer Science CSCI 5622: Machine Learning Chenhao - PowerPoint PPT Presentation

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 19: EM algorithm, Topic modeling Slides adapted from Jordan Boyd-Graber, Chris Ketelsen 1 Administrivia HW4 due, HW5 out Remember that we only count the


  1. Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 19: EM algorithm, Topic modeling Slides adapted from Jordan Boyd-Graber, Chris Ketelsen 1

  2. Administrivia • HW4 due, HW5 out • Remember that we only count the highest 4 homework scores • Final project midpoint presentation • For the final project, each person will be asked to summarize what everyone in the team did • Contact information for printing 2

  3. Second Month Survey Second survey First survey 3

  4. Second Month Survey • Conflicting opinions • wide variety of models, good explanations, good homeworks • Clarity of HW grading is the worst I have ever had for a class. • Depth of content covered • Course is too theory heavy • I liked that the instructor not only requested feedback often, but also acted upon the feedback, changing a few things about how the class and slides are presented. 4

  5. Second Month Survey • Increase exam duration • The professor needs to slow down, and sacrifice some of the math subtleties and complexities in favor of concrete understanding of the topics. • Go into the weeds of the math less 5

  6. Learning Objectives • Learn about Expectation-Maximization algorithm • Learn about latent Dirichlet allocation 6

  7. Gaussian Mixture Models 7

  8. Gaussian Mixture Models 8

  9. Gaussian Mixture Models ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● x2 ● ● ● 0 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 2 ● ● ● ● ● ● ● ● ● ● ● ● ● − 4 − 4 − 2 0 2 4 9 x1

  10. Gaussian Mixture Models 10

  11. Gaussian Mixture Models 11

  12. Gaussian Mixture Models 12

  13. Gaussian Mixture Models 13

  14. Gaussian Mixture Models 14

  15. Gaussian Mixture Models 15

  16. Gaussian Mixture Models 16

  17. Gaussian Mixture Models 17

  18. Latent Variables • z’s correspond to the latent structure that we try to learn in unsupervised learning • From a modeling perspective, they are usually referred to as latent variables 18

  19. EM Algorithm 19

  20. EM Algorithm 20

  21. EM Algorithm 21

  22. EM Algorithm 22

  23. EM Algorithm • EM stands for Expectation-Maximization • A classic algorithm in Dempster, Laird, Rubin, 1977 • An iterative method 23

  24. EM Algorithm 24

  25. EM Algorithm 25

  26. EM Algorithm 26

  27. EM Algorithm 27

  28. EM Algorithm 28

  29. EM Algorithm 29

  30. EM Algorithm 30

  31. EM Algorithm 31

  32. EM Algorithm 32

  33. EM Algorithm 33

  34. EM Algorithm 34

  35. EM Algorithm 35

  36. GMM and K-means 36

  37. GMMs and the EM algorithm • GMMs with the EM Algorithm suffer from some of the same problems as K-Means • Doesn't really work with categorical data • Usually only converges to a local minimum • Have to determine the number of clusters • Only generates convex clusters • But, it also has certain advantages • The clusters are allowed different shapes • We get a soft partitioning of the data 37

  38. Topic models • Discrete count data 38

  39. Topic models • Suppose you have a huge number of documents • Want to know what's going on • Can't read them all (e.g. every New York Times article from the 90's) • Topic models offer a way to get a corpus-level view of major themes • Unsupervised 39

  40. Conceptual approach • Input: a text corpus and number of topics K • Output: Corpus • K topics, each topic is a list of words • Topic assignment for each document Forget the Bootleg, Just Download the Movie Legally Multiplex Heralded As Linchpin To Growth The Shape of Cinema, Transformed At the Click of A Peaceful Crew Puts a Mouse Muppets Where Its Mouth Is Stock Trades: A Better Deal For Investors Isn't Simple The three big Internet portals begin to distinguish Red Light, Green Light: A among themselves as 2-Tone L.E.D. to shopping malls Simplify Screens 40

  41. Conceptual approach • K topics, each topic is a list of words TOPIC 1 TOPIC 2 TOPIC 3 computer, sell, sale, technology, play, film, store, product, system, movie, theater, business, service, site, production, advertising, phone, star, director, market, internet, stage consumer machine 41

  42. Conceptual approach • Topic assignment for each document Internet portals Red Light, Green Stock Trades: A begin to distinguish Light: A Better Deal For among themselves 2-Tone L.E.D. to Investors Isn't as shopping malls Simplify Screens Simple Forget the TOPIC 1 TOPIC 2 Bootleg, Just "TECHNOLOGY" "BUSINESS" Download the Movie Legally Multiplex Heralded The Shape of As Linchpin To Cinema, Growth Transformed At the Click of a A Peaceful Crew Mouse TOPIC 3 Puts Muppets "ENTERTAINMENT" Where Its Mouth Is 42

  43. Topics from Science 43

  44. Why should you care? • Neat way to explore/understand corpus collections • E-discovery • Social media • Scientific data • NLP Applications • Word sense disambiguation • Discourse segmentation • Psychology: word meaning, polysemy • A general way to model count data and a general inference algorithm 44

  45. Topic models • Discrete count data • Gaussian distributions are not appropriate 45

  46. Generative model: Latent Dirichlet Allocation • Generate a document, or a bag of words • Blei, Ng, Jordan. Latent Dirichlet Allocation. JMLR, 2003. 46

  47. Generative model: Latent Dirichlet Allocation • Generate a document, or a bag (1,0,0) (0,0,1) (0,1,0) of words • Multinomial distribution • Distribution over discrete outcomes • Represented by non-negative vector that sums to one (1/3,1/3,1/3) (1/4,1/4,1/2) (1/2,1/2,0) • Picture representation 47

  48. Generative model: Latent Dirichlet Allocation • Generate a document, or a bag (1,0,0) (0,0,1) (0,1,0) of words • Multinomial distribution • Distribution over discrete outcomes • Represented by non-negative vector that sums to one (1/3,1/3,1/3) (1/4,1/4,1/2) (1/2,1/2,0) • Picture representation • Come from a Dirichlet distribution 48

  49. Generative story computer, TOPIC 1 technology, system, service, site, phone, internet, machine TOPIC 2 sell, sale, store, product, business, advertising, market, consumer TOPIC 3 play, film, movie, theater, production, star, director, stage 49

  50. Generative story The three big Internet portals begin to distinguish among themselves as shopping malls Red Light, Green Light: A Stock Trades: A Better Deal 2-Tone L.E.D. to For Investors Isn't Simple Simplify Screens TOPIC 1 TOPIC 2 Forget the Bootleg, Just Download the Movie Legally The Shape of Cinema, Multiplex Heralded As Transformed At the Click of Linchpin To Growth a Mouse A Peaceful Crew Puts Muppets Where Its Mouth Is TOPIC 3 50

  51. Generative story computer, sell, sale, technology, play, film, store, product, system, movie, theater, business, service, site, production, advertising, phone, star, director, market, internet, stage consumer machine Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's iTunes music store and other online services ... 51

  52. Generative story computer, sell, sale, technology, play, film, store, product, system, movie, theater, business, service, site, production, advertising, phone, star, director, market, internet, stage consumer machine Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's iTunes music store and other online services ... 52

  53. Generative story computer, sell, sale, technology, play, film, store, product, system, movie, theater, business, service, site, production, advertising, phone, star, director, market, internet, stage consumer machine Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's iTunes music store and other online services ... 53

  54. Generative story computer, sell, sale, technology, play, film, store, product, system, movie, theater, business, service, site, production, advertising, phone, star, director, market, internet, stage consumer machine Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's iTunes music store and other online services ... 54

  55. Missing component: how to generate a multinomial distribution 55

  56. Missing component: how to generate a multinomial distribution 56

  57. Missing component: how to generate a multinomial distribution 57

Recommend


More recommend