CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014
Methods to Learn Matrix Data Set Data Sequence Time Series Graph & Data Network Classification Decision Tree; Naïve HMM Label Propagation Bayes; Logistic Regression SVM; kNN K-means; hierarchical SCAN; Spectral Clustering clustering; DBSCAN; Clustering Mixture Models; kernel k-means Apriori; GSP; Frequent FP-growth PrefixSpan Pattern Mining Prediction Linear Regression Autoregression Similarity DTW P-PageRank Search Ranking PageRank 2
Matrix Data: Clustering: Part 2 • Revisit K-means • Mixture Model and EM algorithm • Kernel K-means • Summary 3
Recall K-Means • Objective function 𝑙 𝑘 || 2 • 𝐾 = 𝑘=1 𝐷 𝑗 =𝑘 ||𝑦 𝑗 − 𝑑 • Total within-cluster variance • Re-arrange the objective function 𝑙 𝑘 || 2 • 𝐾 = 𝑘=1 𝑗 𝑥 𝑗𝑘 ||𝑦 𝑗 − 𝑑 • 𝑥 𝑗𝑘 ∈ {0,1} • 𝑥 𝑗𝑘 = 1, 𝑗𝑔 𝑦 𝑗 𝑐𝑓𝑚𝑝𝑜𝑡 𝑢𝑝 𝑑𝑚𝑣𝑡𝑢𝑓𝑠 𝑘; 𝑥 𝑗𝑘 = 0, 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 • Looking for: • The best assignment 𝑥 𝑗𝑘 • The best center 𝑑 𝑘 4
Solution of K-Means 𝑙 𝑘 || 2 𝐾 = 𝑥 𝑗𝑘 ||𝑦 𝑗 − 𝑑 • Iterations 𝑘=1 𝑗 • Step 1: Fix centers 𝑑 𝑘 , find assignment 𝑥 𝑗𝑘 that minimizes 𝐾 𝑘 || 2 is the smallest • => 𝑥 𝑗𝑘 = 1, 𝑗𝑔 ||𝑦 𝑗 − 𝑑 • Step 2: Fix assignment 𝑥 𝑗𝑘 , find centers that minimize 𝐾 • => first derivative of 𝐾 = 0 𝜖𝐾 • => 𝜖𝑑 𝑘 = −2 𝑗 𝑥 𝑗𝑘 (𝑦 𝑗 − 𝑑 𝑘 ) = 0 𝑗 𝑥 𝑗𝑘 𝑦 𝑗 • => 𝑑 𝑘 = 𝑗 𝑥 𝑗𝑘 • Note 𝑗 𝑥 𝑗𝑘 is the total number of objects in cluster j 5
Converges! Why?
Limitations of K-Means • K-means has problems when clusters are of differing • Sizes • Densities • Non-Spherical Shapes 12
Limitations of K-Means: Different Density and Size 13
Limitations of K-Means: Non-Spherical Shapes 14
Demo • http://webdocs.cs.ualberta.ca/~yaling/Clu ster/Applet/Code/Cluster.html 15
Connections of K-means to Other Methods K-means Gaussian Kernel K- Mixture means Model 16
Matrix Data: Clustering: Part 2 • Revisit K-means • Mixture Model and EM algorithm • Kernel K-means • Summary 17
Fuzzy Set and Fuzzy Cluster • Clustering methods discussed so far • Every data object is assigned to exactly one cluster • Some applications may need for fuzzy or soft cluster assignment • Ex. An e-game could belong to both entertainment and software • Methods: fuzzy clusters and probabilistic model-based clusters • Fuzzy cluster: A fuzzy set S: F S : X → [0 , 1] (value between 0 and 1) 18
Probabilistic Model-Based Clustering • Cluster analysis is to find hidden categories. • A hidden category (i.e., probabilistic cluster) is a distribution over the data space, which can be mathematically represented using a probability density function (or distribution function). Ex. categories for digital cameras sold consumer line vs. professional line density functions f 1 , f 2 for C 1 , C 2 obtained by probabilistic clustering A mixture model assumes that a set of observed objects is a mixture of instances from multiple probabilistic clusters, and conceptually each observed object is generated independently Our task : infer a set of k probabilistic clusters that is mostly likely to generate D using the above data generation process 19
Mixture Model-Based Clustering • A set C of k probabilistic clusters C 1 , …, C k with probability density functions f 1 , …, f k , respectively, and their probabilities w 1 , …, w k , 𝑘 𝑥 𝑘 = 1 • Probability of an object i generated by cluster C j is: 𝑄(𝑦 𝑗 , 𝑨 𝑗 = 𝐷 𝑘 ) = 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) • Probability of i generated by the set of cluster C is: 𝑄 𝑦 𝑗 = 𝑘 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) 20
Maximum Likelihood Estimation • Since objects are assumed to be generated independently, for a data set D = {x 1 , …, x n }, we have, 𝑄 𝐸 = 𝑄 𝑦 𝑗 = 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) 𝑗 𝑗 𝑘 • Task: Find a set C of k probabilistic clusters s.t. P ( D ) is maximized 21
The EM (Expectation Maximization) Algorithm • The (EM) algorithm: A framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models. • E-st step ep assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters 𝑢 = 𝑞 𝑨 𝑗 = 𝑘 𝜄 𝑢 𝑞(𝐷 𝑢 , 𝑦 𝑗 ∝ 𝑞 𝑦 𝑗 𝐷 𝑢 , 𝜄 𝑢 ) • 𝑥 𝑗𝑘 𝑘 𝑘 𝑘 𝑘 • M-st step ep finds the new clustering or parameters that maximize the expected likelihood 22
Case 1: Gaussian Mixture Model • Generative model • For each object: • Pick its distribution component: 𝑎~𝑁𝑣𝑚𝑢𝑗 𝑥 1 , … , 𝑥 𝑙 • Sample a value from the selected distribution: 2 𝑌~𝑂 𝜈 𝑎 , 𝜏 𝑎 • Overall likelihood function 2 ) • 𝑀 𝐸| 𝜄 = 𝑗 𝑘 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 , 𝜏 𝑘 • Q: What is 𝜄 here? 23
Estimating Parameters 2 ) • 𝑀 𝐸; 𝜄 = 𝑗 log 𝑘 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 , 𝜏 Intractable! 𝑘 • Considering the first derivative of 𝜈 𝑘 : 2 ) 𝜖𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 𝑥 𝑘 𝜖𝑀 𝜖𝑣 𝑘 = 𝑗 • 2 ) 𝑘 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 𝜖𝜈 𝑘 2 ) 2 ) 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 𝜖𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 1 = 𝑗 • 2 ) 2 ) 𝑘 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 𝜖𝜈 𝑘 2 ) 2 ) 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 𝜖𝑚𝑝𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 Like weighted = 𝑗 • likelihood 2 ) 𝑘 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 𝜖𝑣 𝑘 estimation; But the weight is determined by 𝜖𝑚(𝑦 𝑗 )/𝜖𝜈 𝑘 𝑥 𝑗𝑘 = 𝑄(𝑎 = 𝑘|𝑌 = 𝑦 𝑗 , 𝜄) the parameters! 24
Apply EM algorithm • An iterative algorithm (at iteration t+1) • E(expectation)-step • Evaluate the weight 𝑥 𝑗𝑘 when 𝜈 𝑘 , 𝜏 𝑘 , 𝑥 𝑘 are given 𝑢 𝑞(𝑦 𝑗 |𝜈 𝑘 𝑢 ,(𝜏 𝑘 2 ) 𝑢 ) 𝑥 𝑘 𝑢 = • 𝑥 𝑗𝑘 𝑢 𝑞(𝑦 𝑗 |𝜈 𝑘 𝑢 ,(𝜏 𝑘 2 ) 𝑢 ) 𝑘 𝑥 𝑘 • M(maximization)-step • Evaluate 𝜈 𝑘 , 𝜏 𝑘 , 𝜕 𝑘 when 𝑥 𝑗𝑘 ’s are given that maximize the weighted likelihood • It is equivalent to Gaussian distribution parameter estimation when each point has a weight belonging to each distribution 2 𝑢 𝑢 𝑢 𝑦 𝑗 𝑗 𝑥 𝑗𝑘 𝑦 𝑗 −𝜈 𝑘 𝑗 𝑥 𝑗𝑘 𝑢+1 = 𝑢+1 ∝ 𝑗 𝑥 𝑗𝑘 2 ) 𝑢+1 = 𝑢 • 𝜈 𝑘 𝑢 ; (𝜏 𝑘 ; 𝑥 𝑘 𝑢 𝑗 𝑥 𝑗𝑘 𝑗 𝑥 𝑗𝑘 25
K-Means: A Special Case of Gaussian Mixture Model • When each Gaussian component with covariance matrix 𝜏 2 𝐽 • Soft K-means Distance! 2 /𝜏 2 } • 𝑞 𝑦 𝑗 𝜈 𝑘 , 𝜏 2 ∝ exp{− 𝑦 𝑗 − 𝜈 𝑘 • When 𝜏 2 → 0 • Soft assignment becomes hard assignment • 𝑥 𝑗𝑘 → 1, 𝑗𝑔 𝑦 𝑗 is closest to 𝜈 𝑘 (why?) 26
Case 2: Multinomial Mixture Model • Generative model • For each object: • Pick its distribution component: 𝑎~𝑁𝑣𝑚𝑢𝑗 𝑥 1 , … , 𝑥 𝑙 • Sample a value from the selected distribution: 𝑌~𝑁𝑣𝑚𝑢𝑗 𝛾 𝑎1 , 𝛾 𝑎2 , … , 𝛾 𝑎𝑛 • Overall likelihood function • 𝑀 𝐸| 𝜄 = 𝑗 𝑘 𝑥 𝑘 𝑞(𝒚 𝑗 |𝜸 𝑘 ) • 𝑘 𝑥 𝑘 = 1; 𝑚 𝛾 𝑘𝑚 = 1 • Q: What is 𝜄 here? 27
Application: Document Clustering • A vocabulary containing m words • Each document i: • A m-dimensional vector: 𝑑 𝑗1 , 𝑑 𝑗2 , … , 𝑑 𝑗𝑛 • 𝑑 𝑗𝑚 is the number of occurrence of word l appearing in document i • Under unigram assumption Length of document ( 𝑛 𝑑 𝑗𝑚 )! 𝑑 𝑗1 … 𝛾 𝑘𝑛 𝑑 𝑗𝑛 • 𝑞 𝒚 𝑗 𝜸 𝑘 = 𝑑 𝑗1 !…𝑑 𝑗𝑛 ! 𝛾 𝑘1 Constant to all parameters 28
Example 29
Estimating Parameters • 𝑚 𝐸; 𝜄 = 𝑗 log 𝑘 𝜕 𝑘 𝑚 𝑑 𝑗𝑚 𝑚𝑝𝛾 𝑘𝑚 • Apply EM algorithm • E-step: 𝑥 𝑘 𝑞(𝒚 𝑗 |𝜸 𝑘 ) • w 𝑗𝑘 = 𝑘 𝑥 𝑘 𝑞(𝒚 𝑗 |𝜸 𝑘 ) • M-step: maximize weighted likelihood 𝑗 𝑥 𝑗𝑘 𝑚 𝑑 𝑗𝑚 𝑚𝑝𝛾 𝑘𝑚 𝑗 𝑥 𝑗𝑘 𝑑 𝑗𝑚 • 𝛾 𝑘𝑚 = 𝑚′ 𝑗 𝑥 𝑗𝑘 𝑑 𝑗𝑚′ ; 𝜕 𝑘 ∝ 𝑗 𝑥 𝑗𝑘 Weighted percentage of word l in cluster j 30
Better Way for Topic Modeling • Topic: a word distribution • Unigram multinomial mixture model • Once the topic of a document is decided, all its words are generated from that topic • PLSA (probabilistic latent semantic analysis) • Every word of a document can be sampled from different topics • LDA (Latent Dirichlet Allocation) • Assume priors on word distribution and/or document cluster distribution 31
Recommend
More recommend