cs145 introduction to data mining
play

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave Bayes for Text


  1. CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017

  2. Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Naïve Bayes for Text Classification Decision Tree ; KNN; SVM ; NN Clustering K-means; hierarchical PLSA clustering; DBSCAN; Mixture Models Linear Regression Prediction GLM* Apriori; FP growth GSP; PrefixSpan Frequent Pattern Mining Similarity Search DTW 2

  3. Text Data: Topic Models • Text Data and Topic Models • Revisit of Mixture Model • Probabilistic Latent Semantic Analysis (pLSA) • Summary 3

  4. Text Data • Word/term • Document • A sequence of words • Corpus • A collection of documents 4

  5. Represent a Document • Most common way: Bag-of-Words • Ignore the order of words • keep the count c1 c2 c3 c4 c5 m1 m2 m3 m4 Vector space model 5

  6. Topics • Topic • A topic is represented by a word distribution • Relate to an issue 6

  7. Topic Models • Topic modeling • Get topics automatically from a corpus • Assign documents to topics automatically • Most frequently used topic models • pLSA • LDA 7

  8. Text Data: Topic Models • Text Data and Topic Models • Revisit of Mixture Model • Probabilistic Latent Semantic Analysis (pLSA) • Summary 8

  9. Mixture Model-Based Clustering • A set C of k probabilistic clusters C 1 , …, C k • probability density/mass functions: f 1 , …, f k , • Cluster prior probabilities: w 1 , …, w k , σ 𝑘 𝑥 𝑘 = 1 • Joint Probability of an object i and its cluster C j is: • 𝑄(𝑦 𝑗 , 𝑨 𝑗 = 𝐷 𝑘 ) = 𝑥 𝑘 𝑔 𝑘 𝑦 𝑗 • 𝑨 𝑗 : hidden random variable 𝑔 1 (𝑦) • Probability of i is: 𝑔 2 (𝑦) • 𝑄 𝑦 𝑗 = σ 𝑘 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) 9

  10. Maximum Likelihood Estimation • Since objects are assumed to be generated independently, for a data set D = {x 1 , …, x n }, we have, 𝑄 𝐸 = ෑ 𝑄 𝑦 𝑗 = ෑ ෍ 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) 𝑗 𝑗 𝑘 ⇒ 𝑚𝑝𝑕𝑄 𝐸 = ෍ 𝑚𝑝𝑕𝑄 𝑦 𝑗 = ෍ 𝑚𝑝𝑕 ෍ 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) 𝑗 𝑗 𝑘 • Task: Find a set C of k probabilistic clusters s.t. P ( D ) is maximized 10

  11. Gaussian Mixture Model • Generative model • For each object: • Pick its cluster, i.e., a distribution component: 𝑎~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗 𝑥 1 , … , 𝑥 𝑙 • Sample a value from the selected distribution: 2 𝑌|𝑎~𝑂 𝜈 𝑎 , 𝜏 𝑎 • Overall likelihood function 2 ) • 𝑀 𝐸| 𝜄 = ς 𝑗 σ 𝑘 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 , 𝜏 𝑘 s.t. σ 𝑘 𝑥 𝑘 = 1 𝑏𝑜𝑒 𝑥 𝑘 ≥ 0 11

  12. Multinomial Mixture Model • For documents with bag-of-words representation • 𝒚 𝑒 = (𝑦 𝑒1 , 𝑦 𝑒2 , … , 𝑦 𝑒𝑂 ) , 𝑦 𝑒𝑜 is the number of words for nth word in the vocabulary • Generative model • For each document • Sample its cluster label 𝑨~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝝆) • 𝝆 = (𝜌 1 , 𝜌 2 , … , 𝜌 𝐿 ) , 𝜌 𝑙 is the proportion of kth cluster • 𝑞 𝑨 = 𝑙 = 𝜌 𝑙 • Sample its word vector 𝒚 𝑒 ~𝑛𝑣𝑚𝑢𝑗𝑜𝑝𝑛𝑗𝑏𝑚(𝜸 𝑨 ) • 𝜸 𝑨 = 𝛾 𝑨1 , 𝛾 𝑨2 , … , 𝛾 𝑨𝑂 , 𝛾 𝑨𝑜 is the parameter associate with nth word in the vocabulary σ 𝑜 𝑦 𝑒𝑜 ! 𝑦 𝑒𝑜 ∝ ς 𝑜 𝛾 𝑙𝑜 𝑦 𝑒𝑜 • 𝑞 𝒚 𝑒 |𝑨 = 𝑙 = ς 𝑜 𝑦 𝑒𝑜 ! ς 𝑜 𝛾 𝑙𝑜 12

  13. Likelihood Function • For a set of M documents 𝑀 = ෑ 𝑞(𝒚 𝑒 ) = ෑ ෍ 𝑞(𝒚 𝑒 , 𝑨 = 𝑙) 𝑒 𝑒 𝑙 = ෑ ෍ 𝑞 𝒚 𝑒 𝑨 = 𝑙 𝑞(𝑨 = 𝑙) 𝑒 𝑙 𝑦 𝑒𝑜 ∝ ෑ ෍ 𝑞(𝑨 = 𝑙) ෑ 𝛾 𝑙𝑜 𝑒 𝑙 𝑜 13

  14. Mixture of Unigrams • For documents represented by a sequence of words • 𝒙 𝑒 = (𝑥 𝑒1 , 𝑥 𝑒2 , … , 𝑥 𝑒𝑂 𝑒 ) , 𝑂 𝑒 is the length of document d , 𝑥 𝑒𝑜 is the word at the nth position of the document • Generative model • For each document • Sample its cluster label 𝑨~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝝆) • 𝝆 = (𝜌 1 , 𝜌 2 , … , 𝜌 𝐿 ) , 𝜌 𝑙 is the proportion of kth cluster • 𝑞 𝑨 = 𝑙 = 𝜌 𝑙 • For each word in the sequence • Sample the word 𝑥 𝑒𝑜 ~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜸 𝑨 ) • 𝑞 𝑥 𝑒𝑜 |𝑨 = 𝑙 = 𝛾 𝑙𝑥 𝑒𝑜 14

  15. Likelihood Function • For a set of M documents 𝑀 = ෑ 𝑞(𝒙 𝑒 ) = ෑ ෍ 𝑞(𝒙 𝑒 , 𝑨 = 𝑙) 𝑒 𝑒 𝑙 = ෑ ෍ 𝑞 𝒙 𝑒 𝑨 = 𝑙 𝑞(𝑨 = 𝑙) 𝑒 𝑙 = ෑ ෍ 𝑞(𝑨 = 𝑙) ෑ 𝛾 𝑙𝑥 𝑒𝑜 𝑒 𝑙 𝑜 15

  16. Question • Are multinomial mixture model and mixture of unigrams model equivalent? Why? 16

  17. Text Data: Topic Models • Text Data and Topic Models • Revisit of Mixture Model • Probabilistic Latent Semantic Analysis (pLSA) • Summary 17

  18. Notations • Word, document, topic • 𝑥, 𝑒, 𝑨 • Word count in document • 𝑑(𝑥, 𝑒) • Word distribution for each topic ( 𝛾 𝑨 ) • 𝛾 𝑨𝑥 : 𝑞(𝑥|𝑨) • Topic distribution for each document ( 𝜄 𝑒 ) • 𝜄 𝑒𝑨 : 𝑞(𝑨|𝑒) (Yes, soft clustering) 18

  19. Issues of Mixture of Unigrams • All the words in the same documents are sampled from the same topic • In practice, people switch topics during their writing 19

  20. Illustration of pLSA 20

  21. Generative Model for pLSA • Describe how a document is generated probabilistically • For each position in d, 𝑜 = 1, … , 𝑂 𝑒 • Generate the topic for the position as 𝑨 𝑜 ~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜾 𝑒 ), 𝑗. 𝑓. , 𝑞 𝑨 𝑜 = 𝑙 = 𝜄 𝑒𝑙 (Note, 1 trial multinomial, i.e., categorical distribution) • Generate the word for the position as 𝑥 𝑜 ~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜸 𝑨 𝑜 ), 𝑗. 𝑓. , 𝑞 𝑥 𝑜 = 𝑥 = 𝛾 𝑨 𝑜 𝑥 21

  22. Graphical Model Note: Sometimes, people add parameters such as 𝜄 𝑏𝑜𝑒 𝛾 into the graphical model 22

  23. The Likelihood Function for a Corpus • Probability of a word 𝑞 𝑥|𝑒 = ෍ 𝑞(𝑥, 𝑨 = 𝑙|𝑒) = ෍ 𝑞 𝑥 𝑨 = 𝑙 𝑞 𝑨 = 𝑙|𝑒 = ෍ 𝛾 𝑙𝑥 𝜄 𝑒𝑙 𝑙 𝑙 𝑙 • Likelihood of a corpus 𝜌 𝑒 𝑗𝑡 𝑣𝑡𝑣𝑏𝑚𝑚𝑧 𝑑𝑝𝑜𝑡𝑗𝑒𝑓𝑠𝑓𝑒 𝑏𝑡 𝑣𝑜𝑗𝑔𝑝𝑠𝑛 , i.e., 1/M 23

  24. Re-arrange the Likelihood Function • Group the same word from different positions together max 𝑚𝑝𝑕𝑀 = ෍ 𝑑 𝑥, 𝑒 𝑚𝑝𝑕 ෍ 𝜄 𝑒𝑨 𝛾 𝑨𝑥 𝑒𝑥 𝑨 𝑡. 𝑢. ෍ 𝜄 𝑒𝑨 = 1 𝑏𝑜𝑒 ෍ 𝛾 𝑨𝑥 = 1 𝑨 𝑥 24

  25. Optimization: EM Algorithm • Repeat until converge • E-step: for each word in each document, calculate its conditional probability belonging to each topic 𝑞 𝑨 𝑥, 𝑒 ∝ 𝑞 𝑥 𝑨, 𝑒 𝑞 𝑨 𝑒 = 𝛾 𝑨𝑥 𝜄 𝑒𝑨 (𝑗. 𝑓. , 𝑞 𝑨 𝑥, 𝑒 𝛾 𝑨𝑥 𝜄 𝑒𝑨 = ) σ 𝑨′ 𝛾 𝑨′𝑥 𝜄 𝑒𝑨′ • M-step: given the conditional distribution, find the parameters that can maximize the expected likelihood σ 𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥,𝑒 𝛾 𝑨𝑥 ∝ σ 𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 (𝑗. 𝑓. , 𝛾 𝑨𝑥 = σ 𝑥′,𝑒 𝑞 𝑨 𝑥 ′ , 𝑒 𝑑 𝑥 ′ ,𝑒 ) (𝑗. 𝑓. , 𝜄 𝑒𝑨 = σ 𝑥 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 𝜄 𝑒𝑨 ∝ ෍ 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 ) 𝑂 𝑒 𝑥 25

  26. Example • Two documents, two topics • Vocabulary: {data, mining, frequent, pattern, web, information, retrieval} • At some iteration of EM algorithm, E-step 26

  27. Example (Continued) • M-step 𝛾 11 = 0.8 ∗ 5 + 0.5 ∗ 2 = 5/17.6 11.8 + 5.8 𝜄 11 = 11.8 𝛾 12 = 0.8 ∗ 4 + 0.5 ∗ 3 17 = 4.7/17.6 11.8 + 5.8 𝜄 12 = 5.2 𝛾 13 = 3/17.6 17 𝛾 14 = 1.6/17.6 𝛾 15 = 1.3/17.6 𝛾 16 = 1.2/17.6 𝛾 17 = 0.8/17.6 27

  28. Text Data: Topic Models • Text Data and Topic Models • Revisit of Mixture Model • Probabilistic Latent Semantic Analysis (pLSA) • Summary 28

  29. Summary • Basic Concepts • Word/term, document, corpus, topic • Mixture of unigrams • pLSA • Generative model • Likelihood function • EM algorithm 29

  30. Quiz • Q1: Is Multinomial Naïve Bayes a linear classifier? • Q2: In pLSA, For the same word in different positions in a document, do they have the same conditional probability 𝑞 𝑨 𝑥, 𝑒 ? 30

Recommend


More recommend