Redescription Mining Pauli Miettinen 17 November 2010
⇔ An Example VLDB ∧ ICDM ∧ SDM ∧ SIGMOD (J. Han ∧ P .S. Yu) ∨ C.-R. Lin ∨ S. Lonardi Pauli Miettinen 17 Nov 2010
⇔ VLDB ∧ ICDM ∧ SDM ∧ SIGMOD (J. Han ∧ P .S. Yu) ∨ C.-R. Lin ∨ S. Lonardi Conferences Co-Authors Authors Pauli Miettinen 17 Nov 2010
⇔ VLDB ∧ ICDM ∧ SDM ∧ SIGMOD (J. Han ∧ P .S. Yu) ∨ C.-R. Lin ∨ S. Lonardi Conferences Co-Authors Dimitrios Gunopulos Authors Charu C. Aggarwal Philip S. Yu Eamonn J. Keogh ... Pauli Miettinen 17 Nov 2010
Definitions Pauli Miettinen 17 Nov 2010
The Definitions Redescription. Given two data sets with a bijection between the rows, a redescription is a pair of queries (Q 1 ,Q 2 ) over the columns such that (Q 1 ,Q 2 ) satisfies certain constraints and supp(Q 1 ) ≈ supp(Q 2 ). Redescription mining. Given the data sets as above, find the ( k ) best redescriptions. Pauli Miettinen 17 Nov 2010
More Concrete Definition • Data sets: Boolean • Queries: Arbitrary Boolean formulae • Similarity function: Jaccard |supp( Q 1 ) ∩ supp( Q 2 )|/|supp( Q 1 ) ∪ supp( Q 2 )| • Constraints: Minimum support σ min , maximum support σ max , minimum similarity J min , maximum size of formula k , maximum p -value p max (more on this later) Pauli Miettinen 17 Nov 2010
Special Cases • Only conjunctive queries • ”bi-directional” association rule mining Q 1 ⇒ Q 2 and Q 2 ⇒ Q 1 • One query given • classification task Pauli Miettinen 17 Nov 2010
p -values • On-line: assuming independency, what is the probability of the observed support intersection size given support sizes? • Binomial distribution • Off-line: what is the (empirical) probability of finding as good redescriptions with given column and row margins • swap randomization Pauli Miettinen 17 Nov 2010
Algorithms Pauli Miettinen 17 Nov 2010
Some Algorithms • CARTwheels [Ramakrishnan et al. 2004, Kumar 2007] • Greedy [Gallo, M. & Mannila 2008] • Both are for Boolean data • Neither finds arbitrary Boolean formulae Pauli Miettinen 17 Nov 2010
Generalizations Pauli Miettinen 17 Nov 2010
Non-Boolean Data • Queries of type x 1 ∈ [-0.2, 1.3] ∨ x 2 ∈ (- ∞ , 20] • ”Standard” way: binarize data via bucketing • Allows using existing algorithms • Has many problems • Bucketing can be done on the fly [Galbrun & M., submitted] Pauli Miettinen 17 Nov 2010
Example: Bioclimatic Niche Finding • Data: (1) Presence/absence data for mammals in Europe; (2) climatic data (temperature and rainfall) • Question: Find a description over climatic variables that describes the area inhabited by (a group of) mammals (and vice versa) Pauli Miettinen 17 Nov 2010
Bioclimatic Niche Finding: Background • A.k.a. bioclimatic envelope finding • Has been done for a long time by biologists • Only single, hand-selected species • Methods used include regression, neural networks, and genetic algorithms • Niche: realized niche in Grinnellian sense Pauli Miettinen 17 Nov 2010
Niche Finding: Our Contributions • Automate niche finding • Easy-to-understand method (contra genetic algorithms and neural networks) • Allow for more complex sets of species • Can be generalized from species to traits • Traits are more stable on palaeontological scale Pauli Miettinen 17 Nov 2010
Niche Finding: Example Results European Elk ⇔ ([ − 9.80 ≤ t max (Feb) ≤ 0.40] ∧ [12.20 ≤ t max (Jul) ≤ 24.60] ∧ [56.852 ≤ p avg (Aug) ≤ 136.46]) ∨ [183.27 ≤ p avg (Sep) ≤ 238.78] Jaccard = 0.814; support = 582 Wood Mouse ∧ Natterer’s Bat ∧ Eurasian Pygmy Shrew ⇔ ([3.20 ≤ t max (Mar) ≤ 14.50] ∧ [17.30 ≤ t max (Aug) ≤ 25.20] ∧ [14.90 ≤ t max (Sep) ≤ 22.80]) ∨ [19.60 ≤ t avg (Jul) ≤ 19.956] Jaccard = 0.623; support = 681 Pauli Miettinen 17 Nov 2010
European Elk ⇔ ([ − 9.80 ≤ t max (Feb) ≤ 0.40] ∧ [12.20 ≤ t max (Jul) ≤ 24.60] ∧ [56.852 ≤ p avg (Aug) ≤ 136.46]) ∨ [183.27 ≤ p avg (Sep) ≤ 238.78] Pauli Miettinen 17 Nov 2010
Niche Finding: Example Results European Elk ⇔ ([ − 9.80 ≤ t max (Feb) ≤ 0.40] ∧ [12.20 ≤ t max (Jul) ≤ 24.60] ∧ [56.852 ≤ p avg (Aug) ≤ 136.46]) ∨ [183.27 ≤ p avg (Sep) ≤ 238.78] Jaccard = 0.814; support = 582 Wood Mouse ∧ Natterer’s Bat ∧ Eurasian Pygmy Shrew ⇔ ([3.20 ≤ t max (Mar) ≤ 14.50] ∧ [17.30 ≤ t max (Aug) ≤ 25.20] ∧ [14.90 ≤ t max (Sep) ≤ 22.80]) ∨ [19.60 ≤ t avg (Jul) ≤ 19.956] Jaccard = 0.623; support = 681 Pauli Miettinen 17 Nov 2010
Wood Mouse ∧ Natterer’s Bat ∧ Eurasian Pygmy Shrew ⇔ ([3.20 ≤ t max (Mar) ≤ 14.50] ∧ [17.30 ≤ t max (Aug) ≤ 25.20] ∧ [14.90 ≤ t max (Sep) ≤ 22.80]) ∨ [19.60 ≤ t avg (Jul) ≤ 19.956] Pauli Miettinen 17 Nov 2010
Discussion Pauli Miettinen 17 Nov 2010
Pattern Mining or Subgroup Discovery? PM SD Binary data Numerical data Unsupervised Supervised Frequency Interestingness Exhaustive Heuristic Reconstructive Descriptive Pauli Miettinen 17 Nov 2010
Pattern Mining or Subgroup Discovery? PM SD Binary data Numerical data Unsupervised Supervised Frequency Interestingness Exhaustive Heuristic Reconstructive Descriptive Pauli Miettinen 17 Nov 2010
Conclusions • Redescription mining is a promising research direction • (SD ∩ PM) ∩ RDM ≠ ∅ • Still a new direction • There are nails for this hammer
CARTwheels • Grows two classification and regression trees (CARTs) • Fix one tree and grow other to match; alternate • Leaves are matched and paths are the descriptions: (ICDM ) ∨ (¬ICDM ∧ ¬STOC) ⇔ (C. Olston ∧ ¬C. Chekuri ) ∨ (¬C. Olston ∧ ¬A. Wigderson) Pauli Miettinen 17 Nov 2010
(ICDM ) ∨ (¬ICDM ∧ ¬STOC) ⇔ (C. Olston ∧ ¬C. Chekuri ) ∨ (¬C. Olston ∧ ¬A. Wigderson) ICDM Yes No STOC No C. Olston No Yes C. Chekuri A. Wigderson No No
Greedy • Grows formulae in a greedy fashion using beam search • Prunes search space as if monotonicity would hold • If adding a variable does not help now, it will not help later, either • False in general Pauli Miettinen 17 Nov 2010
Recommend
More recommend