CS6220: DATA MINING TECHNIQUES Text Data: Topic Models Instructor: - - PowerPoint PPT Presentation

cs6220 data mining techniques
SMART_READER_LITE
LIVE PREVIEW

CS6220: DATA MINING TECHNIQUES Text Data: Topic Models Instructor: - - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Text Data: Topic Models Instructor: Yizhou Sun yzsun@ccs.neu.edu February 17, 2016 Methods to Learn Matrix Data Text Set Data Sequence Time Series Graph & Images Data Data Network Decision Tree;


slide-1
SLIDE 1

CS6220: DATA MINING TECHNIQUES

Instructor: Yizhou Sun

yzsun@ccs.neu.edu February 17, 2016

Text Data: Topic Models

slide-2
SLIDE 2

Methods to Learn

2

Matrix Data Text Data Set Data Sequence Data Time Series Graph & Network Images Classification

Decision Tree; Naïve Bayes; Logistic Regression SVM; kNN HMM Label Propagation Neural Network

Clustering

K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means* PLSA SCAN; Spectral Clustering

Frequent Pattern Mining

Apriori; FP- growth GSP; PrefixSpan

Prediction

Linear Regression Autoregression Collaborative Filtering

Similarity Search

DTW P-PageRank

Ranking

PageRank

slide-3
SLIDE 3

Text Data: Topic Models

  • Text Data and Topic Models
  • Probabilistic Latent Semantic Analysis
  • Summary

3

slide-4
SLIDE 4

Text Data

  • Word/term
  • Document
  • A bag of words
  • Corpus
  • A collection of documents

4

slide-5
SLIDE 5

Represent a Document

  • Most common way: Bag-of-Words
  • Ignore the order of words
  • keep the count

5

c1 c2 c3 c4 c5 m1 m2 m3 m4

slide-6
SLIDE 6

More Details

  • Represent the doc as a vector where each entry corresponds to a different

word and the number at that entry corresponds to how many times that word was present in the document (or some function of it)

  • Number of words is huge
  • Select and use a smaller set of words that are of interest
  • E.g. uninteresting words: ‘and’, ‘the’ ‘at’, ‘is’, etc. These are called stop-words
  • Stemming: remove endings. E.g. ‘learn’, ‘learning’, ‘learnable’, ‘learned’ could be substituted

by the single stem ‘learn’

  • Other simplifications can also be invented and used
  • The set of different remaining words is called dictionary or vocabulary. Fix an ordering of the

terms in the dictionary so that you can operate them by their index.

  • Can be extended to bi-gram, tri-gram, or so

6

slide-7
SLIDE 7

Topics

  • Topic
  • A topic is represented by a word distribution
  • Relate to an issue

7

slide-8
SLIDE 8

Topic Models

  • Topic modeling
  • Get topics automatically from a

corpus

  • Assign documents to topics

automatically

  • Most frequently used topic

models

  • pLSA
  • LDA

8

slide-9
SLIDE 9

Text Data: Topic Models

  • Text Data and Topic Models
  • Probabilistic Latent Semantic Analysis
  • Summary

9

slide-10
SLIDE 10

Notations

  • Word, document, topic
  • 𝑥, 𝑒, 𝑨
  • Word count in document
  • 𝑑(𝑥, 𝑒)
  • Word distribution for each topic (𝛾𝑨)
  • 𝛾𝑨𝑥: 𝑞(𝑥|𝑨)
  • Topic distribution for each document (𝜄𝑒)
  • 𝜄𝑒𝑨: 𝑞(𝑨|𝑒) (Yes, fuzzy clustering)

10

slide-11
SLIDE 11

Review of Multinomial Distribution

  • Select n data points from K categories, each with

probability 𝑞𝑙

  • n trials of independent categorical distribution
  • E.g., get 1-6 from a dice with 1/6
  • When K=2, binomial distribution
  • n trials of independent Bernoulli distribution
  • E.g., flip a coin to get heads or tails

11

slide-12
SLIDE 12

Generative Model for pLSA

  • Describe how a document is

generated probabilistically

  • For each position in d, 𝑜 = 1, … , 𝑂𝑒
  • Generate the topic for the position as

𝑨𝑜~𝑛𝑣𝑚𝑢 ⋅ 𝜄𝑒 , 𝑗. 𝑓. , 𝑞 𝑨𝑜 = 𝑙 = 𝜄𝑒𝑙

(Note, 1 trial multinomial, i.e., categorical distribution)

  • Generate the word for the position as

𝑥𝑜~𝑛𝑣𝑚𝑢 ⋅ 𝛾𝑨𝑜 , 𝑗. 𝑓. , 𝑞 𝑥𝑜 = 𝑥 = 𝛾𝑨𝑜𝑥

12

slide-13
SLIDE 13

The Likelihood Function for a Corpus

  • Probability of a word

𝑞 𝑥|𝑒 =

𝑙

𝑞(𝑥, 𝑨 = 𝑙|𝑒) =

𝑙

𝑞 𝑥 𝑨 = 𝑙 𝑞 𝑨 = 𝑙|𝑒 =

𝑙

𝛾𝑙𝑥𝜄𝑒𝑙

  • Likelihood of a corpus

13

𝜌𝑒 𝑗𝑡 𝑣𝑡𝑣𝑏𝑚𝑚𝑧 𝑑𝑝𝑜𝑡𝑗𝑒𝑓𝑠𝑓𝑒 𝑏𝑡 𝑣𝑜𝑗𝑔𝑝𝑠𝑛, 𝑥ℎ𝑗𝑑ℎ 𝑑𝑏𝑜 𝑐𝑓 𝑒𝑠𝑝𝑞𝑞𝑓𝑒

slide-14
SLIDE 14

Re-arrange the Likelihood Function

  • Group the same word from different positions together

max 𝑚𝑝𝑕𝑀 =

𝑒𝑥

𝑑 𝑥, 𝑒 𝑚𝑝𝑕

𝑨

𝜄𝑒𝑨 𝛾𝑨𝑥 𝑡. 𝑢.

𝑨

𝜄𝑒𝑨 = 1 𝑏𝑜𝑒

𝑥

𝛾𝑨𝑥 = 1

14

slide-15
SLIDE 15

Optimization: EM Algorithm

  • Repeat until converge
  • E-step: for each word in each document, calculate is conditional

probability belonging to each topic

𝑞 𝑨 𝑥, 𝑒 ∝ 𝑞 𝑥 𝑨, 𝑒 𝑞 𝑨 𝑒 = 𝛾𝑨𝑥𝜄𝑒𝑨 (𝑗. 𝑓. , 𝑞 𝑨 𝑥, 𝑒 = 𝛾𝑨𝑥𝜄𝑒𝑨 𝑨′ 𝛾𝑨′𝑥𝜄𝑒𝑨′ )

  • M-step: given the conditional distribution, find the parameters that

can maximize the expected likelihood 𝛾𝑨𝑥 ∝ 𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 (𝑗. 𝑓. , 𝛾𝑨𝑥 =

𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥,𝑒 𝑥′,𝑒 𝑞 𝑨 𝑥′, 𝑒 𝑑 𝑥′,𝑒 )

𝜄𝑒𝑨 ∝

𝑥

𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 (𝑗. 𝑓. , 𝜄𝑒𝑨 = 𝑥 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 𝑂𝑒 )

15

slide-16
SLIDE 16

Text Data: Topic Models

  • Text Data and Topic Models
  • Probabilistic Latent Semantic Analysis
  • Summary

16

slide-17
SLIDE 17

Summary

  • Basic Concepts
  • Word/term, document, corpus, topic
  • How to represent a document
  • pLSA
  • Generative model
  • Likelihood function
  • EM algorithm

17