redescription mining
play

Redescription Mining Pauli Miettinen 17 November 2010 An Example - PowerPoint PPT Presentation

Redescription Mining Pauli Miettinen 17 November 2010 An Example VLDB ICDM SDM SIGMOD (J. Han P .S. Yu) C.-R. Lin S. Lonardi Pauli Miettinen 17 Nov 2010 VLDB ICDM SDM SIGMOD (J. Han P .S. Yu)


  1. Redescription Mining Pauli Miettinen 17 November 2010

  2. ⇔ An Example VLDB ∧ ICDM ∧ SDM ∧ SIGMOD (J. Han ∧ P .S. Yu) ∨ C.-R. Lin ∨ S. Lonardi Pauli Miettinen 17 Nov 2010

  3. ⇔ VLDB ∧ ICDM ∧ SDM ∧ SIGMOD (J. Han ∧ P .S. Yu) ∨ C.-R. Lin ∨ S. Lonardi Conferences Co-Authors Authors Pauli Miettinen 17 Nov 2010

  4. ⇔ VLDB ∧ ICDM ∧ SDM ∧ SIGMOD (J. Han ∧ P .S. Yu) ∨ C.-R. Lin ∨ S. Lonardi Conferences Co-Authors Dimitrios Gunopulos Authors Charu C. Aggarwal Philip S. Yu Eamonn J. Keogh ... Pauli Miettinen 17 Nov 2010

  5. Definitions Pauli Miettinen 17 Nov 2010

  6. The Definitions Redescription. Given two data sets with a bijection between the rows, a redescription is a pair of queries (Q 1 ,Q 2 ) over the columns such that (Q 1 ,Q 2 ) satisfies certain constraints and supp(Q 1 ) ≈ supp(Q 2 ). Redescription mining. Given the data sets as above, find the ( k ) best redescriptions. Pauli Miettinen 17 Nov 2010

  7. More Concrete Definition • Data sets: Boolean • Queries: Arbitrary Boolean formulae • Similarity function: Jaccard |supp( Q 1 ) ∩ supp( Q 2 )|/|supp( Q 1 ) ∪ supp( Q 2 )| • Constraints: Minimum support σ min , maximum support σ max , minimum similarity J min , maximum size of formula k , maximum p -value p max (more on this later) Pauli Miettinen 17 Nov 2010

  8. Special Cases • Only conjunctive queries • ”bi-directional” association rule mining Q 1 ⇒ Q 2 and Q 2 ⇒ Q 1 • One query given • classification task Pauli Miettinen 17 Nov 2010

  9. p -values • On-line: assuming independency, what is the probability of the observed support intersection size given support sizes? • Binomial distribution • Off-line: what is the (empirical) probability of finding as good redescriptions with given column and row margins • swap randomization Pauli Miettinen 17 Nov 2010

  10. Algorithms Pauli Miettinen 17 Nov 2010

  11. Some Algorithms • CARTwheels [Ramakrishnan et al. 2004, Kumar 2007] • Greedy [Gallo, M. & Mannila 2008] • Both are for Boolean data • Neither finds arbitrary Boolean formulae Pauli Miettinen 17 Nov 2010

  12. Generalizations Pauli Miettinen 17 Nov 2010

  13. Non-Boolean Data • Queries of type x 1 ∈ [-0.2, 1.3] ∨ x 2 ∈ (- ∞ , 20] • ”Standard” way: binarize data via bucketing • Allows using existing algorithms • Has many problems • Bucketing can be done on the fly [Galbrun & M., submitted] Pauli Miettinen 17 Nov 2010

  14. Example: Bioclimatic Niche Finding • Data: (1) Presence/absence data for mammals in Europe; (2) climatic data (temperature and rainfall) • Question: Find a description over climatic variables that describes the area inhabited by (a group of) mammals (and vice versa) Pauli Miettinen 17 Nov 2010

  15. Bioclimatic Niche Finding: Background • A.k.a. bioclimatic envelope finding • Has been done for a long time by biologists • Only single, hand-selected species • Methods used include regression, neural networks, and genetic algorithms • Niche: realized niche in Grinnellian sense Pauli Miettinen 17 Nov 2010

  16. Niche Finding: Our Contributions • Automate niche finding • Easy-to-understand method (contra genetic algorithms and neural networks) • Allow for more complex sets of species • Can be generalized from species to traits • Traits are more stable on palaeontological scale Pauli Miettinen 17 Nov 2010

  17. Niche Finding: Example Results European Elk ⇔ ([ − 9.80 ≤ t max (Feb) ≤ 0.40] ∧ [12.20 ≤ t max (Jul) ≤ 24.60] ∧ [56.852 ≤ p avg (Aug) ≤ 136.46]) ∨ [183.27 ≤ p avg (Sep) ≤ 238.78] Jaccard = 0.814; support = 582 Wood Mouse ∧ Natterer’s Bat ∧ Eurasian Pygmy Shrew ⇔ ([3.20 ≤ t max (Mar) ≤ 14.50] ∧ [17.30 ≤ t max (Aug) ≤ 25.20] ∧ [14.90 ≤ t max (Sep) ≤ 22.80]) ∨ [19.60 ≤ t avg (Jul) ≤ 19.956] Jaccard = 0.623; support = 681 Pauli Miettinen 17 Nov 2010

  18. European Elk ⇔ ([ − 9.80 ≤ t max (Feb) ≤ 0.40] ∧ [12.20 ≤ t max (Jul) ≤ 24.60] ∧ [56.852 ≤ p avg (Aug) ≤ 136.46]) ∨ [183.27 ≤ p avg (Sep) ≤ 238.78] Pauli Miettinen 17 Nov 2010

  19. Niche Finding: Example Results European Elk ⇔ ([ − 9.80 ≤ t max (Feb) ≤ 0.40] ∧ [12.20 ≤ t max (Jul) ≤ 24.60] ∧ [56.852 ≤ p avg (Aug) ≤ 136.46]) ∨ [183.27 ≤ p avg (Sep) ≤ 238.78] Jaccard = 0.814; support = 582 Wood Mouse ∧ Natterer’s Bat ∧ Eurasian Pygmy Shrew ⇔ ([3.20 ≤ t max (Mar) ≤ 14.50] ∧ [17.30 ≤ t max (Aug) ≤ 25.20] ∧ [14.90 ≤ t max (Sep) ≤ 22.80]) ∨ [19.60 ≤ t avg (Jul) ≤ 19.956] Jaccard = 0.623; support = 681 Pauli Miettinen 17 Nov 2010

  20. Wood Mouse ∧ Natterer’s Bat ∧ Eurasian Pygmy Shrew ⇔ ([3.20 ≤ t max (Mar) ≤ 14.50] ∧ [17.30 ≤ t max (Aug) ≤ 25.20] ∧ [14.90 ≤ t max (Sep) ≤ 22.80]) ∨ [19.60 ≤ t avg (Jul) ≤ 19.956] Pauli Miettinen 17 Nov 2010

  21. Discussion Pauli Miettinen 17 Nov 2010

  22. Pattern Mining or Subgroup Discovery? PM SD Binary data Numerical data Unsupervised Supervised Frequency Interestingness Exhaustive Heuristic Reconstructive Descriptive Pauli Miettinen 17 Nov 2010

  23. Pattern Mining or Subgroup Discovery? PM SD Binary data Numerical data Unsupervised Supervised Frequency Interestingness Exhaustive Heuristic Reconstructive Descriptive Pauli Miettinen 17 Nov 2010

  24. Conclusions • Redescription mining is a promising research direction • (SD ∩ PM) ∩ RDM ≠ ∅ • Still a new direction • There are nails for this hammer

  25. CARTwheels • Grows two classification and regression trees (CARTs) • Fix one tree and grow other to match; alternate • Leaves are matched and paths are the descriptions: (ICDM ) ∨ (¬ICDM ∧ ¬STOC) ⇔ (C. Olston ∧ ¬C. Chekuri ) ∨ (¬C. Olston ∧ ¬A. Wigderson) Pauli Miettinen 17 Nov 2010

  26. (ICDM ) ∨ (¬ICDM ∧ ¬STOC) ⇔ (C. Olston ∧ ¬C. Chekuri ) ∨ (¬C. Olston ∧ ¬A. Wigderson) ICDM Yes No STOC No C. Olston No Yes C. Chekuri A. Wigderson No No

  27. Greedy • Grows formulae in a greedy fashion using beam search • Prunes search space as if monotonicity would hold • If adding a variable does not help now, it will not help later, either • False in general Pauli Miettinen 17 Nov 2010

Recommend


More recommend