contributions to large scale data clustering and
play

Contributions to Large Scale Data Clustering and Streaming with - PowerPoint PPT Presentation

Contributions to Large Scale Data Clustering and Streaming with Affinity Propagation. Application to Autonomic Grids. Xiangliang Zhang Direction de th` ese : Mich` ele Sebag et Cecile Germain-Renaud TAO LRI, INRIA, CNRS Universit e de


  1. Contributions to Large Scale Data Clustering and Streaming with Affinity Propagation. Application to Autonomic Grids. Xiangliang Zhang Direction de th` ese : Mich` ele Sebag et Cecile Germain-Renaud TAO − LRI, INRIA, CNRS Universit´ e de Paris-Sud July 28, 2010 1/53

  2. Motivations: Autonomic Computing Major part of the cost: management 2/53

  3. Goals of Autonomic Computing AUTONOMIC VISION & MANIFESTO http://www.research.ibm.com/autonomic/manifesto/ Self-managing system with the ability of ◮ Self-healing: detect, diagnose and repair problems ◮ Self-configuring: automatically incorporate and configure components ◮ Self-optimizing: ensure the optimal functioning w.r.t defined requirements ◮ Self-protecting: anticipate and defend against security breaches How: ◮ pre-requisite is to have a model of the system behavior ◮ there is no model based on first principles Machine Learning and Data Mining for Autonomic Computing [Rish et al., 2005] 3/53

  4. Autonomic Grid Computing System EGEE: Enabling Grids for E-sciencE, http://www.eu-egee.org Infrastructure project, DataGrid(2002-2004), EGEE-I(2004-2006), EGEE-II(2006-2008), EGEE-III(2008-2010) and EGI(2010-2013) 4/53

  5. Summarizing a dataset ◮ Clustering : grouping similar points in the same group (cluster) ◮ Extracting Exemplars: real objects from dataset better suited to complex application domains (e.g., molecules, structured items) ∗ is the averaged center; o is the exemplar 4 3 2 1 0 −1 −2 −3 −4 −5 −5 −4 −3 −2 −1 0 1 2 3 4 5 5/53

  6. Position of the problem Given: Data : E = { x 1 , x 2 , ..., x N } Distance: d ( x i , x j ) Define: Exemplars : { e i } is a subset of E Distortion : D ( { e i } ) = � N e i ( d 2 ( x i , e i ) ) i =1 min Goal: Find a mapping σ , x i → σ ( x i ) ∈ { e i } minimizing the distortion NB : Combinatorial optimization problem (NP). 6/53

  7. Streaming: extracting exemplars in real-time Job stream : jobs submitted by the grid users at 24 ∗ 7, more than 200 jobs/min How to make a summary of the job stream ? Features Requirements streaming of jobs actual jobs as exemplars for traceability arriving fast real-time processing user-visible model available at any time non-stationary distribution change detection 7/53

  8. Contents ◮ Motivations ◮ Clustering : The State of the Art Large-scale Data Clustering ◮ Streaming : Data streams Clustering ◮ Application to Autonomic Computing: A Multi-scale Real-time Grid Monitoring System ◮ Conclusions and Perspectives 8/53

  9. Clustering: The State of the Art 4 3 2 1 0 −1 −2 −3 −4 −5 −5 −4 −3 −2 −1 0 1 2 3 4 5 ◮ Averaged centers : [Bradley et al., 1997] k -means, minimizing the sum-squared distance from a point to its center k -medians, minimizing the sum of distance from a point to its center k -centers, minimizing the maximum distance from a point to its center ◮ Exemplars : [Kaufman and Rousseeuw, 1987] minimizing the sum-squared distance from a point to its exemplar k -medoids, [Kaufman and Rousseeuw, 1990, Ng and Han, 1994] Affinity Propagation [Frey and Dueck, 2007] 9/53

  10. List of main algorithms of clustering ◮ Partitioning methods : k -means, k -medians, k -centers, k -medoids ◮ Hierarchical methods : linkages-based clustering (AHC) BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies [Zhang et al., 1996] CURE: Clustering Using REpresentatives [Guha et al., 1998] ROCK: RObust Clustering using linKs [Guha et al., 1999] CHAMELEON: dynamic model to measure similarity of clusters [Karypis et al., 1999] ◮ Arbitrarily shaped clusters: DbScan: Density-based clustering [Ester, 1996] OPTICS: Ordering Points To Identify the Clustering Structure [Ankerst et al., 1999] ◮ Model-based methods : Naive-Bayes model [Meila and Heckerman, 2001] Mixture of Gaussian models [Banfield and Raftery, 1993] Neural network (SOM, Self-Organizing Map) [Kohonen, 1981] ◮ Spectral clustering methods [Ng et al., 2001] a recent method based on algebraic process of squared distance matrix 10/53

  11. Clustering vs Classification NIPS 2005,2009 workshop on Theoretical Foundations of Clustering Shai Ben-David, Ulrike von Luxburg, John Shawe-Taylor, Naftali Tishby Classification Clustering K classes (given) clusters (unknown) Quality Generalization error many cost functions Focus on Test set Training set Goal Prediction Interpretation Analysis discriminant exploratory Field mature new 11/53

  12. Open questions of clustering ◮ The number of clusters k -means, k -median, k -center, k -medoids set by user Model-based method determined by user Affinity Propagation indirectly set by user ◮ Optimality w.r.t. distortion ◮ Generalization property: stability w.r.t. the data sample/distribution 12/53

  13. Open questions of clustering ◮ The number of clusters k -means, k -median, k -center, k -medoids set by user Model-based method determined by user Affinity Propagation indirectly set by user ◮ Optimality w.r.t. distortion ◮ Generalization property: stability w.r.t. the data sample/distribution Affinity Propagation (AP) [Frey and Dueck, 2007] 12/53

  14. Iterations of Message passing in AP 13/53

  15. Iterations of Message passing in AP 13/53

  16. Iterations of Message passing in AP 13/53

  17. Iterations of Message passing in AP 13/53

  18. Iterations of Message passing in AP 13/53

  19. Iterations of Message passing in AP 13/53

  20. Iterations of Message passing in AP 13/53

  21. Iterations of Message passing in AP 13/53

  22. The AP framework input: Data: x 1 , x 2 , ..., x N Distance: d ( x i , x j ) find: σ : x i → σ ( x i ), exemplar representing x i , such that argmax � N i =1 S ( x i , σ ( x i )) where, S ( x i , x j ) = − d 2 ( x i , x j ) if i � = j s ∗ > = 0: user-defined parameter S ( x i , x i ) = − s ∗ ◮ s ∗ = ∞ , only one exemplar (one cluster) ◮ s ∗ = 0, every point is an exemplar (N clusters) 14/53

  23. AP: a message passing algorithm 15/53

  24. Message passed r ( i , k ) = S ( x i , x k ) − max k ′ , k ′ � = k { a ( i , k ′ ) + S ( x i , x ′ k ) } r ( k , k ) = S ( x k , x k ) − max k ′ , k ′ � = k { S ( x k , x ′ k ) } a ( i , k ) = min { 0 , r ( k , k ) + � i ′ , i ′ � = i , k max { 0 , r ( i ′ , k ) }} a ( k , k ) = � i ′ , i ′ � = k max { 0 , r ( i ′ , k ) } The index of exemplar σ ( x i ) associated to x i is finally defined as: σ ( x i ) = argmax { r ( i , k ) + a ( i , k ) , k = 1 . . . N }

  25. Summary of AP Affinity Propagation (AP) ◮ An exemplar-based clustering method ◮ A message passing algorithm (belief propagation) ◮ Parameterized by s ∗ (not by K) Computational complexity ◮ Similarity computation: O ( N 2 ) ◮ Message passing: O ( N 2 log N ) 17/53

  26. Contents ◮ Motivations ◮ Clustering : The State of the Art Large-scale Data Clustering ◮ Streaming : Data streams Clustering ◮ Application to Autonomic Computing: A Multi-scale Real-time Grid Monitoring System ◮ Conclusions and Perspectives 18/53

  27. Hierarchical AP Divide-and-conquer (inspired by [Nittel et al., 2004] ) 19/53

  28. Hierarchical AP Divide-and-conquer (inspired by [Nittel et al., 2004] ) 19/53

  29. Weighted AP AP WAP x i x i , n i S ( x i , x j ) − → n i × S ( x i , x j ) price for x i to select x j as an exemplar S ( x i , x i ) − → S ( x i , x i ) + ( n i − 1) × ǫ price to select x i as exemplar ǫ is variance of n i points Theorem AP ( x 1 , ..., x 1 , ... ) == WAP (( x 1 , n 1 ) , ( x 2 , n 2 ) , ... ) , x 2 , ..., x 2 � �� � � �� � n 1 copies n 2 copies 20/53

  30. Hi-AP : Hierarchical AP ◮ Complexity of Hi-AP is O ( N 3 / 2 ) [Zhang et al., 2008] 21/53

  31. Hi-AP : Hierarchical AP 22/53

  32. Complexity of Hi-AP Theorem h +2 h +1 ) Hi-AP reduces the complexity to O ( N [Zhang et al., 2009] K : number of exemplars to be clustered on average 1 b = ( N h +1 : K ) branching factor K 2 � N 2 � h +1 : complexity on each branching K � h i =0 b i = b h +1 − 1 : total number of branching b − 1 Therefore: total computational complexity: N K − 1 C ( h ) = K 2 � N N ≫ K K 2 � N 2 � h +2 � h +1 . ≈ h +1 � N 1 K � K h +1 − 1 K Particular cases, C (0) = N 2 and C (1) ∝ N 3 / 2 23/53

  33. Study of the distortion loss ◮ real center of data distribution N ( µ, σ 2 ): µ ◮ empirical center of n data samples: ˆ µ n ◮ distance distribution 1 N (0 , σ 2 + σ 2 x i − ˆ µ n ∼ n ) 0.8 0.6 ◮ selected center (exemplar) : ¯ averaged µ n (closest to center µ n ) ˆ 0.4 seleceted center 0.2 ◮ distance distribution 0 | ¯ µ n − ˆ µ n | = min ( | x i − ˆ µ n | ) −0.2 ∼ Weibull distribution (Type III extreme −0.2 0 0.2 0.4 0.6 0.8 1 1.2 value distribution) 24/53

  34. Weibull distribution (Type III extreme value distribution) 1.8 k= −1.5 k= −1.2 1.6 k= −0.9 k= −0.6 1.4 k= −0.3 1.2 1 0.8 0.6 0.4 0.2 0 −4 −3 −2 −1 0 1 2 3 4 where k is the shape parameter. 25/53

Recommend


More recommend