Pseudodimension for Data Analytics Mateo Riondato Amherst College ICERM — May 17, 2019
Takeaway message High-quality approximations of data mining tasks can be obtained very quickly from small random samples of the dataset. Pseudodimension , a concept from statistical learning theory can be used to analyze the trade-off between sample size and approximation quality . Originally developed for supervised learning , we use it to analyze algorithms for unsupervised, combinatorial problems on graphs, transactional datasets, databases, ... . 2 / 41
Outline 1 Random sampling for data analytics 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day 3 MiSoSouP: approximating interesting subgroups with pseudodimension 4 What else to do with pseudodimension 3 / 41
Approximations from a random sample Random D ataset D , |D| = n Sample S , |S| = ℓ ≪ n Data Mining Task Sampling-based Data Analytics Task For each color c ∈ C , compute the fraction Estimate r c ( D ) with r c ( S ) , for each c ∈ C . r c ( D ) of c -colored jelly beans in D . r c ( S ) = 1 r c ( D ) = 1 � � f c ( j ) f c ( j ) where ℓ n j ∈S j ∈D � 1 if j is c -colored Fast . f c ( j ) = 0 otherwise Acceptable if max c ∈C | r c ( S ) − r c ( D ) | is small . Too expensive . Key challenge : tell how small this error is. 4 / 41
Error bounds max c ∈C | r c ( S ) − r c ( D ) | is not computable from S . Let’s get an upper bound ε to it. Probabilistic upper bound to the max. error Fix a failure probability δ ∈ (0 , 1) . A value ε ∈ (0 , 1) is a probabilistic upper bound to max c ∈C | r c ( S ) − r c ( D ) | if � � Pr max c ∈C | r c ( S ) − r c ( D ) | < ε ≥ 1 − δ . The probability is over the samples of size ℓ . Ingredients to compute ε : δ , C , D or S , and |S| = ℓ : ε = g ( δ, C , D or S , ℓ ) How do we find such a function g ? 5 / 41
Outline ✔ 1 Random sampling for data mining 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day 3 MiSoSouP: approximating interesting subgroups with pseudodimension 4 What else to do with pseudodimension 6 / 41
A classic probabilistic upper bound to the error Theorem (Chernoff bound + Union bound) Let � � ln |C| + ln 2 � 3 δ ε = . ℓ Then � � Pr max c ∈C | r c ( S ) − r c ( D ) | < ε ≥ 1 − δ . Not a function of D or S ! We want ε = g ( δ, C , D or S , ℓ ) ; D or S give information on the complexity of approximation through sampling; “ ln |C| ” is a rough measure of the sample complexity of the task, as it ignores the data. 7 / 41
Are there beter measures of sample complexity? Measures from Statistical Learning Theory replace “ ln |C| ” with h ( C , D ) or h ( C , S ) . VC-dimension, pseudodimension , covering numbers, Rademacher averages, ... Developed for supervised learning and had reputation of being only of theoretical interest ; We showed they can be used for efficient practical algorithms for data mining problems . Example: Betweenness centrality estimation on a graph G = ( V, E ) : � � ln | V | + ln 1 ln diam ( G ) + ln 1 Union ; VC-dimension ; δ δ bound : ε = O [ACM WSDM’14,DMKD’16] : ε = O ℓ ℓ Exponential reduction on important classes of graphs (small-world, power-law, ...). 8 / 41
VC-dimension Vapnik-Chervonenkis VC ( F ) of a family F of subsets of X (or of 0–1 functions from X ) Combinatorial measure of the richness of F . Originally developed to study generalization error bounds for classification [VC71]. Picked also up by the computational geometry community A set X = { x 1 , . . . , x ℓ } ⊆ X is shatered by F if { X ∩ A : A ∈ F} = 2 X VC-dimension of F : size of the largest set that can be shatered by F . 9 / 41
VC-dimension example X = R 2 , F = axis-aligned rectangles Shatering a set of four points is easy, but shatering five points is impossible. y y 0 0 x x Proving a VC-dimension upper bound k requires showing that there is no set of size k that can be shatered. 10 / 41
Pseudodimension Pseudodimension PD ( F ) of a family F of real-valued functions from domain X to [ a, b ] . Combinatorial measure of the richness of F . Originally developed to study generalization error bounds for regression [Pollard84]. Intuition: If the graphs of the f ’s in F cross many times , the pseudodimension is high. 11 / 41
Pseudodimension A set X = { x 1 , . . . , x ℓ } ⊆ X is (pseudo-)shatered by F if there exist t 1 , . . . , t ℓ ∈ R s.t.: � � � � � � sgn( f ( x 1 ) − t 1 ) � � � � . � . � = 2 ℓ : f ∈ F . � � � � sgn( f ( x ℓ ) − t ℓ ) � � � � � � vectors in {− 1 , 1 } ℓ � � � � � � � � sgn( f ( x 1 ) − t 1 ) � � � � . � � = 2 ℓ . : f ∈ F . � � � � sgn( f ( x ℓ ) − t ℓ ) � � � � � � vectors in {− 1 , 1 } ℓ � � PD ( F ) : size of the largest pseudo-shatered set. 12 / 41
Pseudodimension as VC-dimension For each f ∈ F , let R f = { ( x, t ) : t ≤ f ( x ) } ⊂ X × [ a, b ] Define the family of sets F + = { R f , f ∈ F} PD ( F ) = VC ( F + ) 13 / 41
Proving upper bounds to pseudodimension The game is always about restricting the class of sets that may be shatered. Two useful general restrictions [R.-Upfal18 (someone must have known before)]: If B ⊆ X × [ a, b ] is shatered by F + : 1) B may contain at most one element ( x, t ) for each x ∈ X ; 2) B cannot contain any element ( x, a ) for any x ∈ X . 14 / 41
Pseudodimension and sampling Theorem [Li et al. ’01] Let PD ( F ) ≤ d and � � d + ln 2 � 1 . δ ε = O ℓ Then � � � �� � 1 1 � � � � Pr max f ( x ) − E S f ( x ) � < ε ≥ 1 − δ . � � � s s � f ∈F � x ∈S x ∈S If F is finite and d ≪ ln |F| , ε ≪ the one derived with Hoeffding+Union. This theorem works even if F is infinite. 15 / 41
Outline ✔ 1 Random sampling for data analytics ✔ 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day 3 MiSoSouP: approximating interesting subgroups with pseudodimension 4 What else to do with pseudodimension 16 / 41
Making miso soup Ingredients (for 4 people): • 2 teaspoons dashi granules • 4 cups water • 3 tablespoons miso paste • 1 (8 ounce) package silken tofu , diced • 2 green onions , sliced diagonally into 1 / 2 inch pieces “Miso makes a soup loaded with flavour that saves you the hassle of making stock.” Y. Otolenghi (world-class chef) (Really: [R.-Vandin ’18]: Mining Interesting Subgroup with Sampling and Pseudodimension) 17 / 41
Section outline 1 Setings: datasets, subgroups, interestigness measures 2 Approximating subgroups: a sufficient condition 3 The pseudodimension of subgroups 18 / 41
Subgroups D = { t 1 , . . . , t n } t = ( t.A 1 , t.A 2 , . . . , t.A z , t.T ) ∈ Y 1 × · · · × Y z × { 0 , 1 } target transactions description atributes E.g., t 3 = ( blue , 4 , false , 1) , t 4 = ( red , 3 , true , 1) Subgroup B = ( cond 1 , 1 ∨ · · · cond 1 ,r 1 ) ∧ · · · ∧ ( cond q, 1 ∨ · · · cond q,r q ) E.g., B = ( A 1 = blue ∨ A 1 = red ) ∧ ( A 2 = 4) t 3 supports B , t 4 does not. Language L : candidate subgroups of potential interest to the analyst. E.g., L = “conjunctions of two equality conditions” B �∈ L , but (( A 1 = blue ) ∧ ( A 2 = 4)) ∈ L 19 / 41
Mining Interesting Subgroups Interesting subgroup: subgroup associated with target value (e.g., 1) Examples • social networks: atribute = user features, target = interest in a topic • biomedicine: atribute = mutations, target = response to therapy • classification: atribute = features, target = test label XOR prediction. Inherently interpretable ! 20 / 41
Subgroup quality measures p - Qality of B in D : q ( p ) D ( B ) = g D ( B ) p × u D ( B ) cover C D ( B ) Generality of a subgroup B in D : g D ( B ) = | { t ∈ D : t supports B } | |D| Unusualness of B in D : 1 1 � � u D ( B ) = t.T − t.T | C D ( B ) | |D| t ∈ C D ( B ) t ∈D target mean µ of D target mean of C D ( B ) p weights generality vs unusualness (usually p ∈ { 1 / 2 , 1 , 2 } ) p = 1 / 2 ⇒ quality of B ∼ z-score of B Rest of the talk: p = 1 ⇒ quality: q D ( B ) 21 / 41
Example of subgroup quality measures Dataset: A 1 A 2 A 3 T Target mean µ of D : 1 1 0 1 1 � t.T = 3 / 4 = 0 . 75 3 1 1 0 |D| 1 1 0 1 t ∈ D 2 0 1 1 Subgroup B = “ A 1 ≥ 2 ∧ A 3 = 1 ”: Generality g D ( B ) = |{ t ∈D : t supports B } = 2 / 4 = 0 . 5 |D| � 1 Unusualness u D ( B ) = t.T − µ = 1 / 2 − 0 . 75 = − 0 . 25 | C D ( B ) | t ∈ C D ( B ) 1-quality: q D ( B ) = g D ( B ) q D ( B ) = − 0 . 125 22 / 41
The top-k subgroup mining task Input: D , L , k ≥ 1 r D ( k ) : k -th highest quality in D of a subgroup from L , k ≥ 1 . Output: TOP ( D , L , k ) = { B ∈ L : q D ( B ) ≥ r D ( k ) } -1 r( k ) 1 quality D 23 / 41
Recommend
More recommend