CS6220: DATA MINING TECHNIQUES Text Data: Topic Models Instructor: - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Text Data: Topic Models Instructor: Yizhou Sun yzsun@ccs.neu.edu February 17, 2016

Methods to Learn Matrix Data Text Set Data Sequence Time Series Graph & Images Data Data Network Decision Tree; Naïve HMM Label Neural Classification Bayes; Logistic Propagation Network Regression SVM; kNN K-means; hierarchical PLSA SCAN; Spectral Clustering clustering; DBSCAN; Clustering Mixture Models; kernel k-means* Apriori; FP- GSP; Frequent growth PrefixSpan Pattern Mining Prediction Linear Regression Autoregression Collaborative Filtering DTW P-PageRank Similarity Search PageRank Ranking 2

Text Data: Topic Models • Text Data and Topic Models • Probabilistic Latent Semantic Analysis • Summary 3

Text Data • Word/term • Document • A bag of words • Corpus • A collection of documents 4

Represent a Document • Most common way: Bag-of-Words • Ignore the order of words • keep the count c1 c2 c3 c4 c5 m1 m2 m3 m4 5

More Details • Represent the doc as a vector where each entry corresponds to a different word and the number at that entry corresponds to how many times that word was present in the document (or some function of it) • Number of words is huge • Select and use a smaller set of words that are of interest • E.g. uninteresting words: ‘and’, ‘the’ ‘at’, ‘is’, etc. These are called stop-words • Stemming: remove endings. E.g. ‘learn’, ‘learning’, ‘learnable’, ‘learned’ could be substituted by the single stem ‘learn’ • Other simplifications can also be invented and used • The set of different remaining words is called dictionary or vocabulary. Fix an ordering of the terms in the dictionary so that you can operate them by their index. • Can be extended to bi-gram, tri-gram, or so 6

Topics • Topic • A topic is represented by a word distribution • Relate to an issue 7

Topic Models • Topic modeling • Get topics automatically from a corpus • Assign documents to topics automatically • Most frequently used topic models • pLSA • LDA 8

Notations • Word, document, topic • 𝑥, 𝑒, 𝑨 • Word count in document • 𝑑(𝑥, 𝑒) • Word distribution for each topic ( 𝛾 𝑨 ) • 𝛾 𝑨𝑥 : 𝑞(𝑥|𝑨) • Topic distribution for each document ( 𝜄 𝑒 ) • 𝜄 𝑒𝑨 : 𝑞(𝑨|𝑒) (Yes, fuzzy clustering) 10

Review of Multinomial Distribution • Select n data points from K categories, each with probability 𝑞 𝑙 • n trials of independent categorical distribution • E.g., get 1-6 from a dice with 1/6 • When K=2, binomial distribution • n trials of independent Bernoulli distribution • E.g., flip a coin to get heads or tails 11

Generative Model for pLSA • Describe how a document is generated probabilistically • For each position in d, 𝑜 = 1, … , 𝑂 𝑒 • Generate the topic for the position as 𝑨 𝑜 ~𝑛𝑣𝑚𝑢 ⋅ 𝜄 𝑒 , 𝑗. 𝑓. , 𝑞 𝑨 𝑜 = 𝑙 = 𝜄 𝑒𝑙 (Note, 1 trial multinomial, i.e., categorical distribution) • Generate the word for the position as 𝑥 𝑜 ~𝑛𝑣𝑚𝑢 ⋅ 𝛾 𝑨 𝑜 , 𝑗. 𝑓. , 𝑞 𝑥 𝑜 = 𝑥 = 𝛾 𝑨 𝑜 𝑥 12

The Likelihood Function for a Corpus • Probability of a word 𝑞 𝑥|𝑒 = 𝑞(𝑥, 𝑨 = 𝑙|𝑒) = 𝑞 𝑥 𝑨 = 𝑙 𝑞 𝑨 = 𝑙|𝑒 = 𝛾 𝑙𝑥 𝜄 𝑒𝑙 𝑙 𝑙 𝑙 • Likelihood of a corpus 𝜌 𝑒 𝑗𝑡 𝑣𝑡𝑣𝑏𝑚𝑚𝑧 𝑑𝑝𝑜𝑡𝑗𝑒𝑓𝑠𝑓𝑒 𝑏𝑡 𝑣𝑜𝑗𝑔𝑝𝑠𝑛, 𝑥ℎ𝑗𝑑ℎ 𝑑𝑏𝑜 𝑐𝑓 𝑒𝑠𝑝𝑞𝑞𝑓𝑒 13

Re-arrange the Likelihood Function • Group the same word from different positions together max 𝑚𝑝𝑕𝑀 = 𝑑 𝑥, 𝑒 𝑚𝑝𝑕 𝜄 𝑒𝑨 𝛾 𝑨𝑥 𝑒𝑥 𝑨 𝑡. 𝑢. 𝜄 𝑒𝑨 = 1 𝑏𝑜𝑒 𝛾 𝑨𝑥 = 1 𝑨 𝑥 14

Optimization: EM Algorithm • Repeat until converge • E-step: for each word in each document, calculate is conditional probability belonging to each topic 𝛾 𝑨𝑥 𝜄 𝑒𝑨 𝑞 𝑨 𝑥, 𝑒 ∝ 𝑞 𝑥 𝑨, 𝑒 𝑞 𝑨 𝑒 = 𝛾 𝑨𝑥 𝜄 𝑒𝑨 (𝑗. 𝑓. , 𝑞 𝑨 𝑥, 𝑒 = ) 𝑨′ 𝛾 𝑨′𝑥 𝜄 𝑒𝑨′ • M-step: given the conditional distribution, find the parameters that can maximize the expected likelihood 𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥,𝑒 𝛾 𝑨𝑥 ∝ 𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 (𝑗. 𝑓. , 𝛾 𝑨𝑥 = 𝑥′,𝑒 𝑞 𝑨 𝑥 ′ , 𝑒 𝑑 𝑥 ′ ,𝑒 ) (𝑗. 𝑓. , 𝜄 𝑒𝑨 = 𝑥 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 𝜄 𝑒𝑨 ∝ 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 ) 𝑂 𝑒 𝑥 15

Summary • Basic Concepts • Word/term, document, corpus, topic • How to represent a document • pLSA • Generative model • Likelihood function • EM algorithm 17

CS6220: DATA MINING TECHNIQUES Text Data: Topic Models Instructor: - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Text Data: Topic Models Instructor: Yizhou Sun yzsun@ccs.neu.edu February 17, 2016 Methods to Learn Matrix Data Text Set Data Sequence Time Series Graph & Images Data Data Network Decision Tree;

Link Analysis Lecture 7 Link Analysis November 29, 2017 1 CS6220 Data Mining Techniques

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data: Part I Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Sequential and Time Series Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou

Word 2016 Module 3 FORMATTING TEXT AND PARAGRAPHS 1 9/20/2017 WORD MODULE 3 EDITING DOCUMENTS

Question-Answering: Shallow & Deep Techniques for NLP Ling571 Deep Processing Techniques

Welcome to 6 th Grade Back to School Night! Nicole Henry A-7 PTO Video!

Prehistoric Britain YEAR THREE Autumn 1 LESSON THREE WHAT WERE THE DIFFERENT PERIODS IN THE

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Text

Programming Style Crista Lopes impressionism abstract expressionism modernism realism

Text Mining with R Ben Williams 2018 Resources Text Mining with R: Julia Silge (StackOverflow)

A Reinforcement Learning Based System for Minimizing Cloud Storage Service Cost Haoyu Wang 1 ,

CS6220: DATA MINING TECHNIQUES Text Data: Topic Models Instructor: - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Text Data: Topic Models Instructor: Yizhou Sun yzsun@ccs.neu.edu February 17, 2016 Methods to Learn Matrix Data Text Set Data Sequence Time Series Graph & Images Data Data Network Decision Tree;

Link Analysis Lecture 7 Link Analysis November 29, 2017 1 CS6220 Data Mining Techniques

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data: Part I Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Sequential and Time Series Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou

Word 2016 Module 3 FORMATTING TEXT AND PARAGRAPHS 1 9/20/2017 WORD MODULE 3 EDITING DOCUMENTS

Question-Answering: Shallow &amp; Deep Techniques for NLP Ling571 Deep Processing Techniques

Welcome to 6 th Grade Back to School Night! Nicole Henry A-7 PTO Video!

Prehistoric Britain YEAR THREE Autumn 1 LESSON THREE WHAT WERE THE DIFFERENT PERIODS IN THE

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Text

Programming Style Crista Lopes impressionism abstract expressionism modernism realism

Text Mining with R Ben Williams 2018 Resources Text Mining with R: Julia Silge (StackOverflow)

A Reinforcement Learning Based System for Minimizing Cloud Storage Service Cost Haoyu Wang 1 ,

Question-Answering: Shallow & Deep Techniques for NLP Ling571 Deep Processing Techniques

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Text