be certain of how to before mining uncertain data
play

Be certain of how-to before mining uncertain data F. Gullo G. Ponti - PowerPoint PPT Presentation

Be certain of how-to before mining uncertain data F. Gullo G. Ponti A. Tagarelli Yahoo Labs Barcelona, Spain ENEA Research Center Portici (NA), Italy University of Calabria Cosenza, Italy 7th European Conference on


  1. Be certain of how-to before mining uncertain data F. Gullo ∗ G. Ponti † A. Tagarelli ‡ ∗ Yahoo Labs Barcelona, Spain † ENEA Research Center Portici (NA), Italy ‡ University of Calabria Cosenza, Italy 7th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2014) September 15-19, 2014, Nancy (France) Giovanni Ponti Be certain of how-to before mining uncertain data

  2. Uncertainty Uncertainty inherently affects data from a wide range of emerging application domains: sensor data location-based services (e.g., moving objects data) biomedical and biometric data (e.g., gene expression data) distributed applications RFID data Generally due to noisy factors, such as signal noise, instrumental errors, wireless transmission Giovanni Ponti Be certain of how-to before mining uncertain data

  3. Uncertainty (a) (b) (c) Giovanni Ponti Be certain of how-to before mining uncertain data

  4. Uncertainty representation Different granularities: table tuple attribute Different models: fuzzy evidence-oriented probabilistic Attribute-level uncertainty modeled according to a probabilistic model (i.e., a probability distribution) ⇒ uncertain object Giovanni Ponti Be certain of how-to before mining uncertain data

  5. Uncertain object Modeling by regions (domains) of definition and probability density functions (pdfs) Figure borrowed from [Kriegel and Pfeifle, ICDM 2005] Giovanni Ponti Be certain of how-to before mining uncertain data

  6. Uncertain object m -dimensional region multivariate pdf defined over the region Definition (uncertain object) An uncertain object o is a pair ( R , f ): R ⊆ R m is the m -dimensional domain region in which o is defined f : R m → R + 0 is the probability density function of o at each point x ∈ R m such that: ∀ x ∈ R m \ R f ( x ) > 0 , ∀ x ∈ R and f ( x ) = 0 , Giovanni Ponti Be certain of how-to before mining uncertain data

  7. Dealing with uncertainty Two main general tasks: 1 Defining a proximity measure between uncertain objects needed in almost all major data-management and data-mining tasks (e.g., visualization, classification, clustering) 2 Defining a model to summarize a set of uncertain objects required for tasks like data compression or clustering, and to speed-up complex data-analysis/management tasks Giovanni Ponti Be certain of how-to before mining uncertain data

  8. Similarity detection in uncertain data Giovanni Ponti Be certain of how-to before mining uncertain data

  9. Distance between uncertain objects Traditional approaches: 1 Difference between expected values 2 Expected Distance (ED) � � � x − y � 2 ED ( o 1 , o 2 ) = 2 f 1 ( x ) f 2 ( y ) d x d y x ∈ R 1 y ∈ R 2 Main drawbacks: Difference between expected values is inaccurate: it considers only 1 very little information stored in the pdfs: Expected distance is slow: it has quadratic complexity in the 2 number of statistical samples used to represent/approximate pdfs Giovanni Ponti Be certain of how-to before mining uncertain data

  10. Distance between uncertain objects Need for a novel distance measure that trades off between accuracy and efficiency Idea: resort to Information Theory Information Theory alone is not enough Giovanni Ponti Be certain of how-to before mining uncertain data

  11. Distance measures for pdfs Distance measures for pdfs: information-theoretic (IT) measures: Kullback-Leibler (KL), Chernoff , Hellinger , . . . IT measures are accurate, but they work out for pdfs that share a reasonably large overlapping probability values area 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 0 2 2 4 4 12 12 10 6 10 6 8 8 8 8 6 6 4 4 10 10 2 2 12 0 12 0 (a) (b) Giovanni Ponti Be certain of how-to before mining uncertain data

  12. Compound distance for uncertain objects ∆( o i , o j ) = f (∆ IT ( o i , o j ) , ∆ EV ( o i , o j )) ∆ IT involves a comparison by means of a certain IT measure ∆ EV measures the distance proportionally to the difference of the expected values Two critical choices for defining ∆ : IT-measure used for ∆ IT ⇒ Hellinger distance ( H ) 1 � � � ρ ( f , f ′ ) = H ( f , f ′ ) = f ( x ) f ′ ( x ) d x 1 − ρ ( f , f ′ ) x ∈ℜ m way of combining ∆ IT and ∆ EV ⇒ ∆ IT should prevail on ∆ EV as 2 long as discriminating among different cases by means of IT-measures is possible Giovanni Ponti Be certain of how-to before mining uncertain data

  13. Compound distance for uncertain objects Definition (uncertain distance) The uncertain distance between two uncertain objects o = ( R , f ) and o ′ = ( R ′ , f ′ ) is defined as � � � × e − ED 2 (˜ f , ˜ f ′ ) ∆( o , o ′ ) = H ( f , f ′ ) ρ ( f , f ′ ) − 1 − � �� � � �� � � �� � ∆ EV term ∆ IT term combination between ∆ IT and ∆ EV ED 2 (˜ f , ˜ f ′ ) is the expected distance between the uniform-approximation of f and f ′ Giovanni Ponti Be certain of how-to before mining uncertain data

  14. Centroid-based agglomerative hierarchical clustering F. Gullo, G. Ponti, A. Tagarelli, S. Greco [ICDM’08] Application : hierarchical clustering of uncertain objects The U-AHC Algorithm Motivations: Input: a set of uncertain objects D = { o 1 , . . . , o n } Hierarchical clustering is Output: a set of partitions D computationally expensive: need for a fast 1: C ← {{ o 1 } , . . . , { o n }} (yet accurate) proximity 2: D ← { C } measure 3: repeat The way of combining 4: let C i , C j be the pair of clusters in ∆ IT and ∆ EV C such that ∆( P C i , P C j ) is theoretically guarantees high accuracy in an minimum agglomerative hierarchical 5: C ← C \ {C i , C j } ∪ {C i ∪ C j } clustering scheme 6: D ← D ∪ { C } 7: until | C | = 1 Giovanni Ponti Be certain of how-to before mining uncertain data

  15. Uncertain data summarization Giovanni Ponti Be certain of how-to before mining uncertain data

  16. Summarization of a set of uncertain objects Traditional approaches (e.g., Chau et al., UK-means, PAKDD’06) ⇒ uncertain prototype defined as the average of the expected values of the objects to be summarized Main drawbacks: Deterministic representation ⇒ a lot of information is discarded Only central tendency is expressed ⇒ variance is completely ignored Giovanni Ponti Be certain of how-to before mining uncertain data

  17. Summarization of a set of uncertain objects Uncertain objects with the same central tendency: lower-variance, more-compact cluster (left) and higher-variance, less-compact cluster (right) Uncertain objects with different central tendency: lower-variance, less-compact cluster (left) and higher-variance, more-compact cluster (right) Giovanni Ponti Be certain of how-to before mining uncertain data

  18. Summarization of a set of uncertain objects Solutions: 1 Mixture-model-based uncertain data summarization 2 Random-variable-based uncertain data summarization Giovanni Ponti Be certain of how-to before mining uncertain data

  19. Mixture-model-based uncertain data summarization Idea Compute a prototype of a set of uncertain objects as mixture model : set of uncertain objects S = { o i } k i =1 uncertain prototype P S = ( R S , f S ), where R S = � o =( R , f ) ∈ S R , f S ( x ) = ( | S | ) − 1 � o =( R , f ) ∈ S f ( x ) Giovanni Ponti Be certain of how-to before mining uncertain data

  20. Mixture-model-based uncertain data summarization Despite its simplicity, the mixture-model-based prototype plays a key role in a task of clustering uncertain objects: capability of employing a novel clustering criterion that does not require any distance measure between uncertain objects ⇒ minimizing the variance of cluster prototypes (a)–(c): Sets of uncertain objects (a) (b) (b)–(d): The corresponding mixture models (c) (d) Giovanni Ponti Be certain of how-to before mining uncertain data

  21. Minimizing the variance of cluster mixture models for clustering uncertain objects F. Gullo, G. Ponti, A. Tagarelli [ICDM’10, SAM’13] A novel criterion for clustering uncertain objects: minimizing variance of cluster mixture models � σ 2 ( P C ) J ( C ) = C ∈C - accuracy : the lower the variance, the higher the cluster compactness - efficiency : capability of exploiting interesting analytical properties Computing objective function J - Moving object o from C ∈ C to � C ∈ C leads to a new C ′ = C \ ( C ∪ � C ) ∪ ( C ′ ∪ � C ′ ), where C ′ = C \ { o } , � C ′ = � C ∪ { o } - J ( C ′ ) can be efficiently computed in O ( m ) as: J ( C ′ ) = J ( C ) − ( σ 2 ( P C ) + σ 2 ( P � C )) + ( σ 2 ( P C ′ ) + σ 2 ( P � C ′ )) Giovanni Ponti Be certain of how-to before mining uncertain data

  22. The MMVar algorithm Input: A set D of UO; the number k of output clusters Output: A partition C of D 1: compute µ ( o ), µ 2 ( o ), ∀ o ∈ D 2: C ← randomPartition ( D , k ) MMVar 3: compute µ ( P C ), µ 2 ( P C ), ∀ C ∈ C converges to a 4: v ← J ( C ) local optimum 5: repeat of function J in 6: for all o ∈ D do a finite number 7: let C ∈ C be the cluster s.t. o ∈ C I of iterations C ∗ ← arg min � C J C ( C , o , � 8: C ) if C ∗ � = C then MMVar works 9: v = J C ( C , o , � in O ( I k |D| m ) 10: C ) recompute C by moving o from C to C ∗ 11: 12: recompute µ ( P C ), µ 2 ( P C ), µ ( P C ∗ ), µ 2 ( P C ∗ ) 13: until no object in D is relocated Giovanni Ponti Be certain of how-to before mining uncertain data

Recommend


More recommend