co clustering for large datasets
play

Co-clustering for large datasets Mohamed Nadif LIPADE, Universit - PowerPoint PPT Presentation

Co-clustering for large datasets Mohamed Nadif LIPADE, Universit Paris Descartes, France Travaux mens avec G. Govaert et L. Lazhar Nadif (LIPADE) AAFD14, April 29-30, 2014 Co-clustering 1 / 35 Introduction Outline Introduction 1


  1. Co-clustering for large datasets Mohamed Nadif LIPADE, Université Paris Descartes, France Travaux menés avec G. Govaert et L. Lazhar Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 1 / 35

  2. Introduction Outline Introduction 1 Co-clustering methods Binary data Continuous data Latent block model and CML approach 2 Bernoulli Latent block models Gaussian latent block models Asymmetric Gaussian model Factorization 3 Nonnegative Matrix Factorization Nonnegative Matrix Tri-Factorization Conclusion 4 Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 2 / 35

  3. Introduction Co-clustering methods Simultaneous clustering on both dimensions The co-clustering methods have attracted much attention in recent years The block clustering had an influence in applied mathematics from the sixties (Jennings, 1968) First works in J.A. Hartigan, Direct Clustering of a Data Matrix (1972) Works of Govaert (1983) Referred in the literature as bi-clustering, co-clustering, double clustering, direct clustering, coupled clustering Different approaches (see for instance chapter 1, Govaert and Nadif 2013), These approaches can differ in the pattern they seek and the types of data they apply to Organization of the data matrix into homogeneous blocks or extraction of co-clusters no-overlapping co-clustering overlapping co-clustering Aim To cluster the sets of rows and columns simultaneously in order to obtain homogeneous blocks Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 3 / 35

  4. Introduction Co-clustering methods Example of co-clustering data3 Reordred data: co−clustering result 100 100 200 200 300 300 400 400 500 500 600 600 700 700 800 800 900 900 1000 1000 100 200 300 400 500 100 200 300 400 500 Why co-clustering ? (1) : Utilizing duality of clustering (2) : Reducing running time (3) : Discovering hidden latent patterns and generating compact representation (4) : Reducing dimensionality implicitly (5) : High dimension Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 4 / 35

  5. Introduction Co-clustering methods Applications and approaches Fields Text mining: clustering of documents and words simultaneously Bioinformatics: clustering of genes and tissus simultaneously Collaborative Filtering Social Network Analysis Approaches Spectral Factorization Latent block models etc. Softwares Package {biclust} in R , Bicat, etc. R {blockcluster} Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 5 / 35

  6. Introduction Co-clustering methods Notations Let be x = ( x ij ) of size n × d , i ∈ I set of n rows, j ∈ J set of d columns Partition z of I in g clusters z = ( z 1 , . . . , z n ) − → ( z ik ) zi zi 1 zi 2 zi 3 3 0 0 1 ⇒ z ik = 1 if i ∈ k th cluster z i cluster indicator of i = 2 0 1 0 3 0 0 1 and z ik = 0 otherwise 2 0 1 0 1 1 0 0 z . k cardinality of k th cluster, k ∈ { 1 , . . . , g } Partition w of J in m clusters w = ( w 1 , . . . , w d ) − → ( w j ℓ ) ⇒ w j ℓ = 1 if j ∈ ℓ th cluster and w j ℓ = 0 otherwise w j cluster indicator of j = w .ℓ cardinality of ℓ th cluster, ℓ ∈ { 1 , . . . , m } From z and w Block formed by the couple k th and ℓ th clusters is defined by the x ij ’s with z ik w j ℓ = 1 Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 6 / 35

  7. Introduction Co-clustering methods General principle Binary data Contingency table Continuous data Mode �������� Sum mean �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� T1 �������� �������� T1 �������� �������� �������� �������� �������� �������� T1 �������� �������� �������� �������� T0 T0 T0 Criteria Data a k ℓ Criterion � Binary Mode i , j , k ,ℓ z ik w j ℓ | x ij − a k ℓ | I ( z , w ) = � p k . p .ℓ or χ 2 ( z , w ) p k ℓ Contingency Sum k ,ℓ p k ℓ log � i , j , k ,ℓ z ik w j ℓ ( x ij − a k ℓ ) 2 = || x − zaw T || 2 Continuous Mean Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 7 / 35

  8. Introduction Binary data Notations and example 1 2 1 2 3 4 5 6 7 8 9 10 1 3 5 8 10 2 4 6 7 9 a 1 0 1 0 1 0 0 1 0 1 a 1 1 1 1 1 0 0 0 0 0 b 0 1 0 1 0 1 1 0 1 0 A d 1 1 0 1 0 0 0 0 0 0 c 1 0 0 0 0 0 0 1 1 0 h 1 1 1 1 1 0 0 1 0 1 d 1 0 1 0 0 0 0 1 0 0 b 0 0 0 0 0 1 1 1 1 1 e 0 1 0 1 0 1 1 0 1 0 B e 0 0 0 0 0 1 1 1 1 1 f 0 1 0 0 0 1 1 0 1 0 f 0 0 0 0 0 1 0 1 1 1 g 0 1 0 0 0 0 0 1 0 1 j 0 0 0 0 0 1 1 0 1 0 h 1 0 1 0 1 1 0 1 1 1 c 1 0 0 1 0 0 0 0 0 1 i 1 0 0 1 0 0 0 0 0 1 C g 0 0 0 1 1 1 0 0 0 0 j 0 1 0 1 0 0 1 0 0 0 i 1 0 0 0 1 0 1 0 0 0 Binary data x Reorganized data matrix x 1 2 A 1 0 B 0 1 C 0 0 Summary matrix a Matrix Size Definition kj = � x z = ( x z x z kj ) ( g × d ) i z ik x ij i ℓ = � x w = ( x w x w i ℓ ) ( n × m ) j w j ℓ x ij k ℓ = � x zw = ( x zw x zw k ℓ ) ( g × m ) i , j z ik w j ℓ x ij Reduced matrices, sizes and definitions of x z , x w and x zw Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 8 / 35

  9. Introduction Binary data Intermediate data matrices x z , x w and x zw   1 2 5 0 1 3 5 8 10 2 4 6 7 9 3 0   a 1 1 1 1 1 0 0 0 0 0  5 2  A d 1 1 0 1 0 0 0 0 0 0   0 5 h 1 1 1 1 1 0 0 1 0 1   x w = 0 5 b 0 0 0 0 0 1 1 1 1 1   0 4 B e 0 0 0 0 0 1 1 1 1 1     f 0 0 0 0 0 1 0 1 1 1 0 3   j 0 0 0 0 0 1 1 0 1 0 2 1 c 1 0 0 1 0 0 0 0 0 1 2 1 C g 0 0 0 1 1 1 0 0 0 0 2 1 i 1 0 0 0 1 0 1 0 0 0 � � 3 3 2 3 2 0 0 1 0 1 x z = 0 0 0 0 0 4 3 3 4 3 2 0 0 2 2 1 1 0 0 1 � � 13 2 x zw = 0 17 6 3 Minimization of the following criterion � C ( z , w , a ) = z ik w j ℓ | x ij − a k ℓ | , i , j , k ,ℓ where a k ℓ ∈ { 0 , 1 } Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 9 / 35

  10. Introduction Binary data Algorithm Minimization of C ( z , w , a ) by alternated minimization of C ( z , a | w ) and C ( w , a | z ) Crobin (here ⌊ x ⌉ is the nearest integer function) input: x , g , m x zw initialization: z , w , a k ℓ = ⌊ z . k w .ℓ ⌉ k ℓ repeat i ℓ = � x w j w j ℓ x ij repeat � ℓ w j ℓ | x w step 1. z i = argmin k i ℓ − w .ℓ a k ℓ | k z ik x w � step 2. a k ℓ = ⌊ z . k w .ℓ ⌉ i ℓ until convergence kj = � x z i z ik x ij repeat � k z ik | x z step 3. w j = argmin ℓ kj − z . k a k ℓ | j w j ℓ x z � kj step 4. a k ℓ = ⌊ z . k w .ℓ ⌉ until convergence until convergence return z , w , a Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 10 / 35

Recommend


More recommend