CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017
Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Naïve Bayes for Text Classification Decision Tree ; KNN; SVM ; NN Clustering K-means; hierarchical PLSA clustering; DBSCAN; Mixture Models Linear Regression Prediction GLM* Apriori; FP growth GSP; PrefixSpan Frequent Pattern Mining Similarity Search DTW 2
Text Data: Topic Models • Text Data and Topic Models • Revisit of Mixture Model • Probabilistic Latent Semantic Analysis (pLSA) • Summary 3
Text Data • Word/term • Document • A sequence of words • Corpus • A collection of documents 4
Represent a Document • Most common way: Bag-of-Words • Ignore the order of words • keep the count c1 c2 c3 c4 c5 m1 m2 m3 m4 Vector space model 5
Topics • Topic • A topic is represented by a word distribution • Relate to an issue 6
Topic Models • Topic modeling • Get topics automatically from a corpus • Assign documents to topics automatically • Most frequently used topic models • pLSA • LDA 7
Text Data: Topic Models • Text Data and Topic Models • Revisit of Mixture Model • Probabilistic Latent Semantic Analysis (pLSA) • Summary 8
Mixture Model-Based Clustering • A set C of k probabilistic clusters C 1 , …, C k • probability density/mass functions: f 1 , …, f k , • Cluster prior probabilities: w 1 , …, w k , σ 𝑘 𝑥 𝑘 = 1 • Joint Probability of an object i and its cluster C j is: • 𝑄(𝑦 𝑗 , 𝑨 𝑗 = 𝐷 𝑘 ) = 𝑥 𝑘 𝑔 𝑘 𝑦 𝑗 • 𝑨 𝑗 : hidden random variable 𝑔 1 (𝑦) • Probability of i is: 𝑔 2 (𝑦) • 𝑄 𝑦 𝑗 = σ 𝑘 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) 9
Maximum Likelihood Estimation • Since objects are assumed to be generated independently, for a data set D = {x 1 , …, x n }, we have, 𝑄 𝐸 = ෑ 𝑄 𝑦 𝑗 = ෑ 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) 𝑗 𝑗 𝑘 ⇒ 𝑚𝑝𝑄 𝐸 = 𝑚𝑝𝑄 𝑦 𝑗 = 𝑚𝑝 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) 𝑗 𝑗 𝑘 • Task: Find a set C of k probabilistic clusters s.t. P ( D ) is maximized 10
Gaussian Mixture Model • Generative model • For each object: • Pick its cluster, i.e., a distribution component: 𝑎~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗 𝑥 1 , … , 𝑥 𝑙 • Sample a value from the selected distribution: 2 𝑌|𝑎~𝑂 𝜈 𝑎 , 𝜏 𝑎 • Overall likelihood function 2 ) • 𝑀 𝐸| 𝜄 = ς 𝑗 σ 𝑘 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 , 𝜏 𝑘 s.t. σ 𝑘 𝑥 𝑘 = 1 𝑏𝑜𝑒 𝑥 𝑘 ≥ 0 11
Multinomial Mixture Model • For documents with bag-of-words representation • 𝒚 𝑒 = (𝑦 𝑒1 , 𝑦 𝑒2 , … , 𝑦 𝑒𝑂 ) , 𝑦 𝑒𝑜 is the number of words for nth word in the vocabulary • Generative model • For each document • Sample its cluster label 𝑨~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝝆) • 𝝆 = (𝜌 1 , 𝜌 2 , … , 𝜌 𝐿 ) , 𝜌 𝑙 is the proportion of kth cluster • 𝑞 𝑨 = 𝑙 = 𝜌 𝑙 • Sample its word vector 𝒚 𝑒 ~𝑛𝑣𝑚𝑢𝑗𝑜𝑝𝑛𝑗𝑏𝑚(𝜸 𝑨 ) • 𝜸 𝑨 = 𝛾 𝑨1 , 𝛾 𝑨2 , … , 𝛾 𝑨𝑂 , 𝛾 𝑨𝑜 is the parameter associate with nth word in the vocabulary σ 𝑜 𝑦 𝑒𝑜 ! 𝑦 𝑒𝑜 ∝ ς 𝑜 𝛾 𝑙𝑜 𝑦 𝑒𝑜 • 𝑞 𝒚 𝑒 |𝑨 = 𝑙 = ς 𝑜 𝑦 𝑒𝑜 ! ς 𝑜 𝛾 𝑙𝑜 12
Likelihood Function • For a set of M documents 𝑀 = ෑ 𝑞(𝒚 𝑒 ) = ෑ 𝑞(𝒚 𝑒 , 𝑨 = 𝑙) 𝑒 𝑒 𝑙 = ෑ 𝑞 𝒚 𝑒 𝑨 = 𝑙 𝑞(𝑨 = 𝑙) 𝑒 𝑙 𝑦 𝑒𝑜 ∝ ෑ 𝑞(𝑨 = 𝑙) ෑ 𝛾 𝑙𝑜 𝑒 𝑙 𝑜 13
Mixture of Unigrams • For documents represented by a sequence of words • 𝒙 𝑒 = (𝑥 𝑒1 , 𝑥 𝑒2 , … , 𝑥 𝑒𝑂 𝑒 ) , 𝑂 𝑒 is the length of document d , 𝑥 𝑒𝑜 is the word at the nth position of the document • Generative model • For each document • Sample its cluster label 𝑨~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝝆) • 𝝆 = (𝜌 1 , 𝜌 2 , … , 𝜌 𝐿 ) , 𝜌 𝑙 is the proportion of kth cluster • 𝑞 𝑨 = 𝑙 = 𝜌 𝑙 • For each word in the sequence • Sample the word 𝑥 𝑒𝑜 ~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜸 𝑨 ) • 𝑞 𝑥 𝑒𝑜 |𝑨 = 𝑙 = 𝛾 𝑙𝑥 𝑒𝑜 14
Likelihood Function • For a set of M documents 𝑀 = ෑ 𝑞(𝒙 𝑒 ) = ෑ 𝑞(𝒙 𝑒 , 𝑨 = 𝑙) 𝑒 𝑒 𝑙 = ෑ 𝑞 𝒙 𝑒 𝑨 = 𝑙 𝑞(𝑨 = 𝑙) 𝑒 𝑙 = ෑ 𝑞(𝑨 = 𝑙) ෑ 𝛾 𝑙𝑥 𝑒𝑜 𝑒 𝑙 𝑜 15
Question • Are multinomial mixture model and mixture of unigrams model equivalent? Why? 16
Text Data: Topic Models • Text Data and Topic Models • Revisit of Mixture Model • Probabilistic Latent Semantic Analysis (pLSA) • Summary 17
Notations • Word, document, topic • 𝑥, 𝑒, 𝑨 • Word count in document • 𝑑(𝑥, 𝑒) • Word distribution for each topic ( 𝛾 𝑨 ) • 𝛾 𝑨𝑥 : 𝑞(𝑥|𝑨) • Topic distribution for each document ( 𝜄 𝑒 ) • 𝜄 𝑒𝑨 : 𝑞(𝑨|𝑒) (Yes, soft clustering) 18
Issues of Mixture of Unigrams • All the words in the same documents are sampled from the same topic • In practice, people switch topics during their writing 19
Illustration of pLSA 20
Generative Model for pLSA • Describe how a document is generated probabilistically • For each position in d, 𝑜 = 1, … , 𝑂 𝑒 • Generate the topic for the position as 𝑨 𝑜 ~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜾 𝑒 ), 𝑗. 𝑓. , 𝑞 𝑨 𝑜 = 𝑙 = 𝜄 𝑒𝑙 (Note, 1 trial multinomial, i.e., categorical distribution) • Generate the word for the position as 𝑥 𝑜 ~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜸 𝑨 𝑜 ), 𝑗. 𝑓. , 𝑞 𝑥 𝑜 = 𝑥 = 𝛾 𝑨 𝑜 𝑥 21
Graphical Model Note: Sometimes, people add parameters such as 𝜄 𝑏𝑜𝑒 𝛾 into the graphical model 22
The Likelihood Function for a Corpus • Probability of a word 𝑞 𝑥|𝑒 = 𝑞(𝑥, 𝑨 = 𝑙|𝑒) = 𝑞 𝑥 𝑨 = 𝑙 𝑞 𝑨 = 𝑙|𝑒 = 𝛾 𝑙𝑥 𝜄 𝑒𝑙 𝑙 𝑙 𝑙 • Likelihood of a corpus 𝜌 𝑒 𝑗𝑡 𝑣𝑡𝑣𝑏𝑚𝑚𝑧 𝑑𝑝𝑜𝑡𝑗𝑒𝑓𝑠𝑓𝑒 𝑏𝑡 𝑣𝑜𝑗𝑔𝑝𝑠𝑛 , i.e., 1/M 23
Re-arrange the Likelihood Function • Group the same word from different positions together max 𝑚𝑝𝑀 = 𝑑 𝑥, 𝑒 𝑚𝑝 𝜄 𝑒𝑨 𝛾 𝑨𝑥 𝑒𝑥 𝑨 𝑡. 𝑢. 𝜄 𝑒𝑨 = 1 𝑏𝑜𝑒 𝛾 𝑨𝑥 = 1 𝑨 𝑥 24
Optimization: EM Algorithm • Repeat until converge • E-step: for each word in each document, calculate its conditional probability belonging to each topic 𝑞 𝑨 𝑥, 𝑒 ∝ 𝑞 𝑥 𝑨, 𝑒 𝑞 𝑨 𝑒 = 𝛾 𝑨𝑥 𝜄 𝑒𝑨 (𝑗. 𝑓. , 𝑞 𝑨 𝑥, 𝑒 𝛾 𝑨𝑥 𝜄 𝑒𝑨 = ) σ 𝑨′ 𝛾 𝑨′𝑥 𝜄 𝑒𝑨′ • M-step: given the conditional distribution, find the parameters that can maximize the expected likelihood σ 𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥,𝑒 𝛾 𝑨𝑥 ∝ σ 𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 (𝑗. 𝑓. , 𝛾 𝑨𝑥 = σ 𝑥′,𝑒 𝑞 𝑨 𝑥 ′ , 𝑒 𝑑 𝑥 ′ ,𝑒 ) (𝑗. 𝑓. , 𝜄 𝑒𝑨 = σ 𝑥 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 𝜄 𝑒𝑨 ∝ 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 ) 𝑂 𝑒 𝑥 25
Example • Two documents, two topics • Vocabulary: {data, mining, frequent, pattern, web, information, retrieval} • At some iteration of EM algorithm, E-step 26
Example (Continued) • M-step 𝛾 11 = 0.8 ∗ 5 + 0.5 ∗ 2 = 5/17.6 11.8 + 5.8 𝜄 11 = 11.8 𝛾 12 = 0.8 ∗ 4 + 0.5 ∗ 3 17 = 4.7/17.6 11.8 + 5.8 𝜄 12 = 5.2 𝛾 13 = 3/17.6 17 𝛾 14 = 1.6/17.6 𝛾 15 = 1.3/17.6 𝛾 16 = 1.2/17.6 𝛾 17 = 0.8/17.6 27
Text Data: Topic Models • Text Data and Topic Models • Revisit of Mixture Model • Probabilistic Latent Semantic Analysis (pLSA) • Summary 28
Summary • Basic Concepts • Word/term, document, corpus, topic • Mixture of unigrams • pLSA • Generative model • Likelihood function • EM algorithm 29
Quiz • Q1: Is Multinomial Naïve Bayes a linear classifier? • Q2: In pLSA, For the same word in different positions in a document, do they have the same conditional probability 𝑞 𝑨 𝑥, 𝑒 ? 30
Recommend
More recommend