spectral clustering on handwritten digits database
play

Spectral Clustering on Handwritten Digits Database Danielle - PowerPoint PPT Presentation

Introduction Approach Validation Implementation Project Schedule Deliverables References Spectral Clustering on Handwritten Digits Database Danielle Middlebrooks dmiddle1@math.umd.edu Advisor: Kasso Okoudjou kasso@umd.edu Department of


  1. Introduction Approach Validation Implementation Project Schedule Deliverables References Spectral Clustering on Handwritten Digits Database Danielle Middlebrooks dmiddle1@math.umd.edu Advisor: Kasso Okoudjou kasso@umd.edu Department of Mathematics University of Maryland- College Park Advance Scientific Computing I October 6, 2015 Middlebrooks Spectral Clustering on Handwritten Digits Database

  2. Introduction Approach Validation Implementation Project Schedule Deliverables References Outline Introduction 1 Approach 2 Validation 3 Implementation 4 Project Schedule 5 Deliverables 6 References 7 Middlebrooks Spectral Clustering on Handwritten Digits Database

  3. Introduction Approach Validation Implementation Project Schedule Deliverables References Background Information Spectral Clustering is clustering technique that makes use of the spectrum of the similarity matrix derived from the data set. Implements a clustering algorithm on a reduced dimension. Advantages: Simple algorithm to implement and uses standard linear algebra methods to solve the problem efficiently. Motivation: Implement an algorithm that groups objects in a data set to other objects with ones that have a similar behavior. Middlebrooks Spectral Clustering on Handwritten Digits Database

  4. Introduction Approach Validation Implementation Project Schedule Deliverables References Definitions A graph G = ( V , E ) where V = { v 1 , ..., v n } W- Adjacency matrix. � 1 , if v i , v j are connected by an edge w ij = 0 , otherwise The degree of a vertex d i = � n j =1 w ij . The Degree matrix denoted D,where each d 1 , ..., d n are on the diagonal. Denote a subset of vertices A ⊂ V and the compliment ¯ A = V \ A | A | = number of vertices in A vol ( A ) = � i ∈ A d i W ( A , B ) = � i ∈ A , j ∈ B w ij Middlebrooks Spectral Clustering on Handwritten Digits Database

  5. Introduction Approach Validation Implementation Project Schedule Deliverables References Example 1 4 5 2 3 6 7 G = ( V , E ) where V = { 1 , ..., 7 } .  0 1 1 0 0 0 0   2 0 0 0 0 0 0  1 0 1 0 0 0 0 0 2 0 0 0 0 0         1 1 0 0 0 0 0 0 0 2 0 0 0 0         0 0 0 0 1 1 1 0 0 0 3 0 0 0 W = D =         0 0 0 1 0 1 1 0 0 0 0 3 0 0         0 0 0 1 1 0 1 0 0 0 0 0 3 0     0 0 0 1 1 1 0 0 0 0 0 0 0 3 Middlebrooks Spectral Clustering on Handwritten Digits Database

  6. Introduction Approach Validation Implementation Project Schedule Deliverables References Definitions Similarity graph: Given a data set x 1 , ..., x n and a notion of “similar”, a similarity graph is a graph where x i and x j have an edge between them if they are considered “similar”. Some ways to determine if data points are similar are: e -neighborhood graph k -nearest neighborhood graph Use Similarity Function Unnormalized Laplacian Matrix: L = D − W Normalized Laplacian Matrix: L sym = D − 1 / 2 LD − 1 / 2 = I − D − 1 / 2 WD − 1 / 2 Middlebrooks Spectral Clustering on Handwritten Digits Database

  7. Introduction Approach Validation Implementation Project Schedule Deliverables References Why this works? Spectral Clustering is motivated by approximating the RatioCut or NCut on a given graph. Given a similarity graph, to construct a partition is the solve the min cut problem. That is k min cut ( A 1 , ... A k ) := 1 � W ( A i , ¯ A i ) 2 1 In order to insist each partition is reasonably large, use RatioCut or NCut. Thus the size of each partition is measured by the number of vertices or weights of the edges, respectively. Middlebrooks Spectral Clustering on Handwritten Digits Database

  8. Introduction Approach Validation Implementation Project Schedule Deliverables References Why this works? Thus k W ( A i , ¯ RatioCut ( A 1 , ..., A k ) := 1 A i ) � 2 | A i | i =1 k W ( A i , ¯ NCut ( A 1 , ..., A k ) := 1 A i ) � 2 vol ( A i ) i =1 Solving these versions makes the problem NP hard. Spectral Clustering solves the relaxed versions of these problems. [2.] Middlebrooks Spectral Clustering on Handwritten Digits Database

  9. Introduction Approach Validation Implementation Project Schedule Deliverables References Why this works? Case k = 2. Given a subset A . Then A ) = W ( A , ¯ vol ( A ) + W ( A , ¯ NCut ( A , ¯ A ) A ) A ) . Define the cluster indicator vector vol ( ¯ f by � 1 vol ( A ) , if v i ∈ A f ( v i ) = f i = if v i ∈ ¯ 1 − A ) , A vol ( ¯ Then 1 1 w ij ( f i − f j ) 2 = W ( A , ¯ f T Lf = � A )) 2 A )( vol ( A ) + vol ( ¯ 1 1 � d i f 2 f T Df = i = vol ( A ) + vol ( ¯ A ) Middlebrooks Spectral Clustering on Handwritten Digits Database

  10. Introduction Approach Validation Implementation Project Schedule Deliverables References Why this works? Thus minimizing the NCut problem is equivalent to min NCut ( A , B ) = f T Lf f T Df The relaxation problem is given by f T Lf minimize f T Df f ∈ R n f T D 1 = 0 subject to Middlebrooks Spectral Clustering on Handwritten Digits Database

  11. Introduction Approach Validation Implementation Project Schedule Deliverables References Why this works? It can be should the relaxation problem is a form of the Rayleigh-Ritz quotient. The Rayleigh Ritz theorem states: Given A a Hermitian matrix, then x T Ax λ min = min x � =0 x T x Thus in the relaxation problem, the solution f is the second eigenvector of the generalized problem. Middlebrooks Spectral Clustering on Handwritten Digits Database

  12. Introduction Approach Validation Implementation Project Schedule Deliverables References Procedure Normalized Similarity Graph Database Laplacian Compute the Put the eigenvectors in Eigenvectors a matrix and Normalize Perform dimension Cluster the points reduction Middlebrooks Spectral Clustering on Handwritten Digits Database

  13. Introduction Approach Validation Implementation Project Schedule Deliverables References Databases The database I will be using is the MNIST Handwritten digits database. Has 1000 of each digit 0-9. Each image is of size 28 x 28 pixels. Figure: Test images. Simon A. J. Winder Middlebrooks Spectral Clustering on Handwritten Digits Database

  14. Introduction Approach Validation Implementation Project Schedule Deliverables References Similarity Graph −|| xi − xj || 2 Guassian Similarity Function: s ( x i , x j ) = e where σ is a 2 σ 2 parameter. If s ( x i , x j ) < ǫ connect an edge between x i and x j . Each x i ∈ R 28 x 28 and corresponds to an image. Thus 28 28 || x i − x j || 2 � � kl − x j kl ) 2 ( x i 2 = k =1 l =1 Middlebrooks Spectral Clustering on Handwritten Digits Database

  15. Introduction Approach Validation Implementation Project Schedule Deliverables References Laplacian Matrix � if s ( x i , x j ) < ǫ 1 , W - Adjacency matrix w ij = 0 , otherwise D - Degree matrix L = D − W L sym = D − 1 / 2 LD − 1 / 2 = I − D − 1 / 2 WD − 1 / 2 Middlebrooks Spectral Clustering on Handwritten Digits Database

  16. Introduction Approach Validation Implementation Project Schedule Deliverables References Computing Eigenvectors Use an iterative method called the Power Method to find the first k eigenvectors of L sym = D − 1 / 2 LD − 1 / 2 = I − D − 1 / 2 WD − 1 / 2 . Start with an initial nonzero vector, v 0 , for the eigenvector Let B = D − 1 / 2 WD − 1 / 2 . Form the sequence given by: for i = 1 , ..., l x i = Bv i − 1 x i v i = || x i || end Middlebrooks Spectral Clustering on Handwritten Digits Database

  17. Introduction Approach Validation Implementation Project Schedule Deliverables References Computing Eigenvectors (Con’t) For large values of l we will obtain a good approximation of the dominant eigenvector of B . This will give us the eigenvector corresponding the the largest eigenvalue of B which corresponds to the smallest eigenvalue of L sym . To find the next eigenvector, after selecting the random initial vector v 0 , subtract the component of v 0 that is parallel to the eigenvector of the largest eigenvalue. Middlebrooks Spectral Clustering on Handwritten Digits Database

  18. Introduction Approach Validation Implementation Project Schedule Deliverables References Computing Eigenvectors (Con’t) We put the first k eigenvectors into a matrix and normalize it. Let T ∈ R nxk be the eigenvector matrix with norm 1. Set v i , j t i , j = k v 2 ( � i , k ) 1 / 2     v 11 v 12 v 13 . . . v 1 k t 11 t 12 t 13 . . . t 1 k . . . . . . . . ... ... . . . . . . . .     . . . . . . . .         ⇒ v i 1 v i 2 v i 3 . . . v ik t i 1 t i 2 t i 3 . . . t ik     . . . . . . . .  ...   ...  . . . . . . . .     . . . . . . . .     v n 1 v n 2 v n 3 . . . v nk t n 1 t n 2 t n 3 . . . t nk Middlebrooks Spectral Clustering on Handwritten Digits Database

Recommend


More recommend