Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 7 Jan-Willem van de Meent ( credit: David Blei)
Review: K-means Clustering Objective: Sum of Squares μ 1 μ 2 One-hot assignment Center for cluster k µ k Alternate between two steps μ 3 1. Minimize SSE w.r.t. z n 2. Minimize SSE w.r.t. μ k
Review: Probabilistic K-means Generative Model z n ∼ Discrete ( π ) x n | z n = k ∼ Norm ( µ k , Σ k ) Questions 1. What is log p ( X , z | μ , Σ , π ) ? 2. For what choice of π and Σ do we recover K -means? Σ k = σ 2 I π k = 1 / K Same as K-means when:
Review: Probabilistic K-means Assignment Update Parameter Updates P N N k : = n = 1 z nk π = ( N 1 / N ,..., N K / N ) 1 P P N 1 Idea: Replace hard µ k = n = 1 z nk x n P N k assignments with P N P N 1 n = 1 z nk ( x n � µ k )( x n � µ k ) > Σ k = soft assignments N k
Review: Soft K-means Soft Assignment Update Parameter Updates P N N k : = n = 1 γ nk P π = ( N 1 / N ,..., N K / N ) 1 P P N 1 Idea: Replace hard µ k = n = 1 γ nk x n P N k assignments with P P N 1 n = 1 γ nk ( x n � µ k )( x n � µ k ) > Σ k = soft assignments N k
Review: Lower Bound on Log Likelihood (multiplication by 1)
Review: Lower Bound on Log Likelihood (multiplication by 1) (multiplication by 1)
Review: Lower Bound on Log Likelihood (multiplication by 1) (multiplication by 1) (Bayes rule)
Review: Lower Bound on Log Likelihood (multiplication by 1) (multiplication by 1) (Bayes rule)
Review: Lower Bound on Log Likelihood
Review: Lower Bound on Log Likelihood
Review: EM for Gaussian Mixtures Generative Model Expectation Maximization z n ∼ Discrete ( π ) Initialize θ x n | z n = k ∼ Norm ( µ k , Σ k ) Repeat until convergence 1. Expectation Step 2. Maximization Step
TOPIC MODELS Borrowing from : David Blei (Columbia)
Word Mixtures Idea: Model text as a mixture over words (ignore order) gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,, Words: Topics:
EM for Word Mixtures Generative Model Expectation Maximization Initialize θ Repeat until convergence 1. Expectation Step 2. Maximization Step
EM for Word Mixtures Generative Model E-step: Update assignments M-step: Update parameters
Topic Modeling Topic proportions and Topics Documents assignments gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,, • Each topic is a distribution over words • Each document is a mixture over topics • Each word is drawn from one topic distribution
Topic Modeling Topic proportions and Topics Documents assignments gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,, Words: Topics:
EM for Topic Models (PLSI/PLSA*) Generative Model E-step: Update assignments M-step: Update parameters *(Probabilistic Latent Semantic Indexing, a.k.a. Probabilistic Latent Semantic Analysis)
Topic Models with Priors Generative Model (with priors) Maximum a Posteriori E-step: Update assignments M-step: Update parameters
Latent Dirichlet Allocation (a.k.a. PLSI/PLSA with priors) Per-word Proportions topic assignment parameter Per-document Topic Observed Topics topic proportions word parameter α θ d Z d,n W d,n β k η N D K
Intermezzo: Dirichlet Distribution
Intermezzo: Dirichlet Distribution
Intermezzo: Conjugacy Likelihood (discrete) Prior (Dirichlet) Question: What distribution is the posterior? More examples: https://en.wikipedia.org/wiki/Conjugate_prior
MAP estimation for LDA Generative Model (with priors) Maximum a Posteriori E-step: Update assignments M-step: Update parameters
Variational Inference Idea: Maximize Evidence Lower Bound (ELBO) Maximizing the ELBO is equivalent to minimizing the KL divergence
Variational EM Use Factorized Approximation for q ( z , β , θ ) Discrete Dirichlet Dirichlet Variational E-step: Maximize w.r.t. φ (expectations closed form for Dirichlet distributions) Variational M-step: Maximize w.r.t. λ and γ (analogous to MAP estimation)
Variational EM Use Factorized Approximation for q ( z , β , θ ) Discrete Dirichlet Dirichlet Variational E-step: Maximize w.r.t. φ (expectations closed form for Dirichlet distributions) Variational M-step: Maximize w.r.t. λ and γ (analogous to MAP estimation)
Example Inference 0.4 0.3 Probability 0.2 0.1 0.0 1 8 16 26 36 46 56 66 76 86 96 Topics
Example Inference human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations
Example Inference
Example Inference problem model selection species problems rate male forest mathematical constant males ecology number distribution females fish new time sex ecological mathematics number species conservation university size female diversity two values evolution population first value populations natural numbers average population ecosystems work rates sexual populations time data behavior endangered mathematicians density evolutionary tropical chaos measured genetic forests chaotic models reproductive ecosystem
Performance Metric: Perplexity Nematode abstracts Associated Press 3400 7000 Smoothed Unigram Smoothed Unigram Smoothed Mixt. Unigrams Smoothed Mixt. Unigrams 3200 6500 LDA LDA Fold in pLSI Fold in pLSI 3000 6000 2800 5500 Perplexity 2600 Perplexity 5000 2400 4500 2200 4000 2000 3500 1800 1600 3000 1400 2500 0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 120 140 160 180 200 Number of Topics Number of Topics 7000 ⇢ − P d log p ( w d ) � perplexity = exp P d N d Marginal likelihood (evidence) of held out documents
Extensions of LDA • EM inference (PLSA/PLSI) yields similar results to Variational inference or MAP inference (LDA) on most data • Reason for popularity of LDA: can be embedded in more complicated models
Extensions: Supervised LDA θ d α Z d,n W d,n β k N K η , σ 2 Y d D 1 Draw topic proportions θ | α ∼ Dir ( α ) . 2 For each word • Draw topic assignment z n | θ ∼ Mult ( θ ) . • Draw word w n | z n , β 1 : K ∼ Mult ( β z n ) . 3 Draw response variable y | z 1 : N , η , σ 2 ∼ N z , σ 2 � � η > ¯ , where z = ( 1 / N ) P N ¯ n = 1 z n .
Extensions: Supervised LDA least bad more awful his both problem guys has featuring their motion unfortunately watchable than routine character simple supposed its films dry many perfect worse not director offered while fascinating flat one will charlie performance power dull movie characters paris between complex ● ● ● ● ● ● ● ● ● ● − 30 − 20 − 10 have not 0 one however 10 20 like about from cinematography you movie there screenplay was all which performances just would who pictures some they much effective out its what picture
Extensions: Correlated Topic Model β k Σ η d Z d,n W d,n N D K µ Noconjugate prior on topic proportions Estimate a covariance matrix Σ that parameterizes correlations between topics in a document
Extensions: Dynamic Topic Models 1789 2009 Inaugural addresses My fellow citizens: I stand here today humbled by the task AMONG the vicissitudes incident to life no event could before us, grateful for the trust you have bestowed, mindful have filled me with greater anxieties than that of which of the sacrifices borne by our ancestors... the notification was transmitted by your order... Track changes in word distributions associated with a topic over time.
Extensions: Dynamic Topic Models α α α θ d θ d θ d Z d,n Z d,n Z d,n W d,n W d,n W d,n N N N D D D . . . β k, 2 β k,T β k, 1 K
Recommend
More recommend