CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017

Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Naïve Bayes for Text Classification Decision Tree ; KNN; SVM ; NN Clustering K-means; hierarchical PLSA clustering; DBSCAN; Mixture Models Linear Regression Prediction GLM* Apriori; FP growth GSP; PrefixSpan Frequent Pattern Mining Similarity Search DTW 2

Text Data: Topic Models • Text Data and Topic Models • Revisit of Mixture Model • Probabilistic Latent Semantic Analysis (pLSA) • Summary 3

Text Data • Word/term • Document • A sequence of words • Corpus • A collection of documents 4

Represent a Document • Most common way: Bag-of-Words • Ignore the order of words • keep the count c1 c2 c3 c4 c5 m1 m2 m3 m4 Vector space model 5

Topics • Topic • A topic is represented by a word distribution • Relate to an issue 6

Topic Models • Topic modeling • Get topics automatically from a corpus • Assign documents to topics automatically • Most frequently used topic models • pLSA • LDA 7

Mixture Model-Based Clustering • A set C of k probabilistic clusters C 1 , …, C k • probability density/mass functions: f 1 , …, f k , • Cluster prior probabilities: w 1 , …, w k , σ 𝑘 𝑥 𝑘 = 1 • Joint Probability of an object i and its cluster C j is: • 𝑄(𝑦 𝑗 , 𝑨 𝑗 = 𝐷 𝑘 ) = 𝑥 𝑘 𝑔 𝑘 𝑦 𝑗 • 𝑨 𝑗 : hidden random variable 𝑔 1 (𝑦) • Probability of i is: 𝑔 2 (𝑦) • 𝑄 𝑦 𝑗 = σ 𝑘 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) 9

Maximum Likelihood Estimation • Since objects are assumed to be generated independently, for a data set D = {x 1 , …, x n }, we have, 𝑄 𝐸 = ෑ 𝑄 𝑦 𝑗 = ෑ ෍ 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) 𝑗 𝑗 𝑘 ⇒ 𝑚𝑝𝑕𝑄 𝐸 = ෍ 𝑚𝑝𝑕𝑄 𝑦 𝑗 = ෍ 𝑚𝑝𝑕 ෍ 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) 𝑗 𝑗 𝑘 • Task: Find a set C of k probabilistic clusters s.t. P ( D ) is maximized 10

Gaussian Mixture Model • Generative model • For each object: • Pick its cluster, i.e., a distribution component: 𝑎~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗 𝑥 1 , … , 𝑥 𝑙 • Sample a value from the selected distribution: 2 𝑌|𝑎~𝑂 𝜈 𝑎 , 𝜏 𝑎 • Overall likelihood function 2 ) • 𝑀 𝐸| 𝜄 = ς 𝑗 σ 𝑘 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 , 𝜏 𝑘 s.t. σ 𝑘 𝑥 𝑘 = 1 𝑏𝑜𝑒 𝑥 𝑘 ≥ 0 11

Multinomial Mixture Model • For documents with bag-of-words representation • 𝒚 𝑒 = (𝑦 𝑒1 , 𝑦 𝑒2 , … , 𝑦 𝑒𝑂 ) , 𝑦 𝑒𝑜 is the number of words for nth word in the vocabulary • Generative model • For each document • Sample its cluster label 𝑨~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝝆) • 𝝆 = (𝜌 1 , 𝜌 2 , … , 𝜌 𝐿 ) , 𝜌 𝑙 is the proportion of kth cluster • 𝑞 𝑨 = 𝑙 = 𝜌 𝑙 • Sample its word vector 𝒚 𝑒 ~𝑛𝑣𝑚𝑢𝑗𝑜𝑝𝑛𝑗𝑏𝑚(𝜸 𝑨 ) • 𝜸 𝑨 = 𝛾 𝑨1 , 𝛾 𝑨2 , … , 𝛾 𝑨𝑂 , 𝛾 𝑨𝑜 is the parameter associate with nth word in the vocabulary σ 𝑜 𝑦 𝑒𝑜 ! 𝑦 𝑒𝑜 ∝ ς 𝑜 𝛾 𝑙𝑜 𝑦 𝑒𝑜 • 𝑞 𝒚 𝑒 |𝑨 = 𝑙 = ς 𝑜 𝑦 𝑒𝑜 ! ς 𝑜 𝛾 𝑙𝑜 12

Likelihood Function • For a set of M documents 𝑀 = ෑ 𝑞(𝒚 𝑒 ) = ෑ ෍ 𝑞(𝒚 𝑒 , 𝑨 = 𝑙) 𝑒 𝑒 𝑙 = ෑ ෍ 𝑞 𝒚 𝑒 𝑨 = 𝑙 𝑞(𝑨 = 𝑙) 𝑒 𝑙 𝑦 𝑒𝑜 ∝ ෑ ෍ 𝑞(𝑨 = 𝑙) ෑ 𝛾 𝑙𝑜 𝑒 𝑙 𝑜 13

Mixture of Unigrams • For documents represented by a sequence of words • 𝒙 𝑒 = (𝑥 𝑒1 , 𝑥 𝑒2 , … , 𝑥 𝑒𝑂 𝑒 ) , 𝑂 𝑒 is the length of document d , 𝑥 𝑒𝑜 is the word at the nth position of the document • Generative model • For each document • Sample its cluster label 𝑨~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝝆) • 𝝆 = (𝜌 1 , 𝜌 2 , … , 𝜌 𝐿 ) , 𝜌 𝑙 is the proportion of kth cluster • 𝑞 𝑨 = 𝑙 = 𝜌 𝑙 • For each word in the sequence • Sample the word 𝑥 𝑒𝑜 ~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜸 𝑨 ) • 𝑞 𝑥 𝑒𝑜 |𝑨 = 𝑙 = 𝛾 𝑙𝑥 𝑒𝑜 14

Likelihood Function • For a set of M documents 𝑀 = ෑ 𝑞(𝒙 𝑒 ) = ෑ ෍ 𝑞(𝒙 𝑒 , 𝑨 = 𝑙) 𝑒 𝑒 𝑙 = ෑ ෍ 𝑞 𝒙 𝑒 𝑨 = 𝑙 𝑞(𝑨 = 𝑙) 𝑒 𝑙 = ෑ ෍ 𝑞(𝑨 = 𝑙) ෑ 𝛾 𝑙𝑥 𝑒𝑜 𝑒 𝑙 𝑜 15

Question • Are multinomial mixture model and mixture of unigrams model equivalent? Why? 16

Notations • Word, document, topic • 𝑥, 𝑒, 𝑨 • Word count in document • 𝑑(𝑥, 𝑒) • Word distribution for each topic ( 𝛾 𝑨 ) • 𝛾 𝑨𝑥 : 𝑞(𝑥|𝑨) • Topic distribution for each document ( 𝜄 𝑒 ) • 𝜄 𝑒𝑨 : 𝑞(𝑨|𝑒) (Yes, soft clustering) 18

Issues of Mixture of Unigrams • All the words in the same documents are sampled from the same topic • In practice, people switch topics during their writing 19

Illustration of pLSA 20

Generative Model for pLSA • Describe how a document is generated probabilistically • For each position in d, 𝑜 = 1, … , 𝑂 𝑒 • Generate the topic for the position as 𝑨 𝑜 ~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜾 𝑒 ), 𝑗. 𝑓. , 𝑞 𝑨 𝑜 = 𝑙 = 𝜄 𝑒𝑙 (Note, 1 trial multinomial, i.e., categorical distribution) • Generate the word for the position as 𝑥 𝑜 ~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜸 𝑨 𝑜 ), 𝑗. 𝑓. , 𝑞 𝑥 𝑜 = 𝑥 = 𝛾 𝑨 𝑜 𝑥 21

Graphical Model Note: Sometimes, people add parameters such as 𝜄 𝑏𝑜𝑒 𝛾 into the graphical model 22

The Likelihood Function for a Corpus • Probability of a word 𝑞 𝑥|𝑒 = ෍ 𝑞(𝑥, 𝑨 = 𝑙|𝑒) = ෍ 𝑞 𝑥 𝑨 = 𝑙 𝑞 𝑨 = 𝑙|𝑒 = ෍ 𝛾 𝑙𝑥 𝜄 𝑒𝑙 𝑙 𝑙 𝑙 • Likelihood of a corpus 𝜌 𝑒 𝑗𝑡 𝑣𝑡𝑣𝑏𝑚𝑚𝑧 𝑑𝑝𝑜𝑡𝑗𝑒𝑓𝑠𝑓𝑒 𝑏𝑡 𝑣𝑜𝑗𝑔𝑝𝑠𝑛 , i.e., 1/M 23

Re-arrange the Likelihood Function • Group the same word from different positions together max 𝑚𝑝𝑕𝑀 = ෍ 𝑑 𝑥, 𝑒 𝑚𝑝𝑕 ෍ 𝜄 𝑒𝑨 𝛾 𝑨𝑥 𝑒𝑥 𝑨 𝑡. 𝑢. ෍ 𝜄 𝑒𝑨 = 1 𝑏𝑜𝑒 ෍ 𝛾 𝑨𝑥 = 1 𝑨 𝑥 24

Optimization: EM Algorithm • Repeat until converge • E-step: for each word in each document, calculate its conditional probability belonging to each topic 𝑞 𝑨 𝑥, 𝑒 ∝ 𝑞 𝑥 𝑨, 𝑒 𝑞 𝑨 𝑒 = 𝛾 𝑨𝑥 𝜄 𝑒𝑨 (𝑗. 𝑓. , 𝑞 𝑨 𝑥, 𝑒 𝛾 𝑨𝑥 𝜄 𝑒𝑨 = ) σ 𝑨′ 𝛾 𝑨′𝑥 𝜄 𝑒𝑨′ • M-step: given the conditional distribution, find the parameters that can maximize the expected likelihood σ 𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥,𝑒 𝛾 𝑨𝑥 ∝ σ 𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 (𝑗. 𝑓. , 𝛾 𝑨𝑥 = σ 𝑥′,𝑒 𝑞 𝑨 𝑥 ′ , 𝑒 𝑑 𝑥 ′ ,𝑒 ) (𝑗. 𝑓. , 𝜄 𝑒𝑨 = σ 𝑥 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 𝜄 𝑒𝑨 ∝ ෍ 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 ) 𝑂 𝑒 𝑥 25

Example • Two documents, two topics • Vocabulary: {data, mining, frequent, pattern, web, information, retrieval} • At some iteration of EM algorithm, E-step 26

Example (Continued) • M-step 𝛾 11 = 0.8 ∗ 5 + 0.5 ∗ 2 = 5/17.6 11.8 + 5.8 𝜄 11 = 11.8 𝛾 12 = 0.8 ∗ 4 + 0.5 ∗ 3 17 = 4.7/17.6 11.8 + 5.8 𝜄 12 = 5.2 𝛾 13 = 3/17.6 17 𝛾 14 = 1.6/17.6 𝛾 15 = 1.3/17.6 𝛾 16 = 1.2/17.6 𝛾 17 = 0.8/17.6 27

Summary • Basic Concepts • Word/term, document, corpus, topic • Mixture of unigrams • pLSA • Generative model • Likelihood function • EM algorithm 29

Quiz • Q1: Is Multinomial Naïve Bayes a linear classifier? • Q2: In pLSA, For the same word in different positions in a document, do they have the same conditional probability 𝑞 𝑨 𝑥, 𝑒 ? 30

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave Bayes for Text

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS145: INTRODUCTION TO DATA MINING Sequence Data: Sequential Pattern Mining Instructor: Yizhou

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 7: Vector Data: K Nearest Neighbor Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Sequence Data: Similarity Search Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 6: Vector Data: Neural Network Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Course Project Overview Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues

CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 1: Introduction Instructor: Yizhou Sun yzsun@cs.ucla.edu

CS145: INTRODUCTION TO DATA MINING 1: Introduction Instructor: Yizhou Sun yzsun@cs.ucla.edu

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

1 Automating Machine Learning and Deep Learning Workflows 2 Information Name: Mourad

HOW TO SET UP A COMMUNITY-BASED ENVIRONMENTAL HEALTH SENSING PROJECT Dawn Nafus @dawnnafus

Privacy-Preserving Computation with Trusted Computing via Scramble-then-Compute Hung Dang, Anh

Off-line Data Vali lidation for Water Network Modeling Studies M. Quiones* G, C. Verde** and

Take your Wor ord tem emplates s to o the he ne next xt le level! l! Doc ocuments

Chapter 9 Cardinality Estimation How Many Rows Does a Query Yield? Cardinality Estimation

(R)DBMSs today DMBS Approx. market share (Microsoft Access) Database Systems Oracle

Trivariate Density Revisited Steven E. Shreve Carnegie Mellon University Conference in

Sambuz

Useful Links

Newsletter

Mail Us