influence measures for cart
play

Influence measures for CART Jean-Michel Poggi Orsay, Paris Sud - PowerPoint PPT Presentation

Influence measures for CART Jean-Michel Poggi Orsay, Paris Sud & Paris Descartes Joint work with Avner Bar-Hen Servane Gey (MAP5, Paris Descartes ) J-M. Poggi Influence measures for CART Introduction Influence measures for CART CART


  1. Influence measures for CART Jean-Michel Poggi Orsay, Paris Sud & Paris Descartes Joint work with Avner Bar-Hen Servane Gey (MAP5, Paris Descartes ) J-M. Poggi Influence measures for CART

  2. Introduction Influence measures for CART CART Exploring the Paris Tax Revenues dataset CART Classification And Regression Trees, Breiman et al. (1984) ▶ Learning set L = { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } , n i.i.d. observations of a random vector ( X , Y ) ▶ Vector X = ( X 1 , ..., X p ) of explanatory variables, X ∈ ℝ p , and Y ∈ 풴 where 풴 is either a class label or a numerical response ▶ For classification problems, a classifier t is a mapping t : ℝ p → 풴 and the Bayes classifier is to estimate ▶ For regression problems, we suppose that Y = f ( X ) + 휀 and f is the regression function to estimate J-M. Poggi Influence measures for CART

  3. Introduction Influence measures for CART CART Exploring the Paris Tax Revenues dataset CART tree CART tree as a piecewise constant function J-M. Poggi Influence measures for CART

  4. Introduction Influence measures for CART CART Exploring the Paris Tax Revenues dataset Growing step, stopping rule: ▶ recursive partitioning by maximizing local decreasing heterogeneity ▶ do not split a pure node or a node containing a few data Pruning step: ▶ the maximal tree overfits the data ▶ an optimal tree is pruned subtree by penalizing the prediction error by the model complexity Penalized criterion f ∣ T , ℒ n ) + 훼 ∣ ˜ T ∣ crit 훼 ( T ) = R n ( f , ˆ n R n ( f , ˆ f ∣ T , ℒ n ) the error term (MSE for regression or misclassification rate) ∣ ˜ T ∣ the number of leaves of T J-M. Poggi Influence measures for CART

  5. Introduction Influence measures for CART CART Exploring the Paris Tax Revenues dataset CART Classification And Regression Trees, Breiman et al. (1984) ▶ nonparametric model + data partitioning ▶ numerical + categorical predictors ▶ easy to interpret models ▶ non linear modelling ▶ base rule for: bagging, boosting, random forests ▶ single framework for: regression, binary or multiclass classification ▶ see Zhang, Singer (2010) and Hastie, Tibshirani, Friedman (2009) ▶ In the sequel, CART trees obtained using ▶ R package rpart ▶ the default parameters (Gini heterogeneity function to grow the maximal tree and pruning with 10-fold CV) J-M. Poggi Influence measures for CART

  6. Introduction Influence measures for CART CART Exploring the Paris Tax Revenues dataset CART and stability ▶ CART instability ▶ Cheze, Poggi (2006) outiliers using boosting ▶ Briand et al. (2009) sensitivity using a similarity measure between trees ▶ Bousquet, Elisseeff (2002) stability through jackknife ▶ Classically, robustness deals with model stability, considered globally ▶ Focus on individual observations diagnosis issues rather than model properties or variable selection problems ▶ We use decision trees to perform diagnosis on observations ▶ We use influence function, a classical diagnostic method to measure the perturbation induced by a single observation: stability issue through jackknife J-M. Poggi Influence measures for CART

  7. Presentation Introduction Influence on predictions Influence measures for CART Influence on partitions Exploring the Paris Tax Revenues dataset CART specific notion of influence Influence measures for CART ▶ Quantifying the differences between ▶ reference tree T obtained from the complete sample ℒ n ▶ jackknife trees ( T ( − i ) ) 1 ⩽ i ⩽ n obtained from ( ℒ n ∖ { ( X i , Y i ) } ) 1 ⩽ i ⩽ n Three kinds of IF for CART ▶ we derive three kinds of IF based on jackknife trees ▶ influence on predictions focusing on predictive performance ▶ influence on partitions highlighting the tree structure following a classical distinction, see Miglio and Soffritti (2004) + ▶ CART specific influence derived from the pruned sequences of trees J-M. Poggi Influence measures for CART

  8. Presentation Introduction Influence on predictions Influence measures for CART Influence on partitions Exploring the Paris Tax Revenues dataset CART specific notion of influence Influence on predictions I 1 and I 2 are based only on the predictions Definition I 1 and I 2 ▶ I 1 , closely related to the resubstitution estimate of the prediction error, evaluates the impact of a single change on all the predictions ∑ n I 1 ( x i ) = 1 l T ( x k ) ∕ = T ( − i ) ( x k ) k = 1 ▶ I 2 , closely related to the leave-one-out estimate of the prediction error I 2 ( x i ) = 1 l T ( x i ) ∕ = T ( − i ) ( x i ) J-M. Poggi Influence measures for CART

  9. Presentation Introduction Influence on predictions Influence measures for CART Influence on partitions Exploring the Paris Tax Revenues dataset CART specific notion of influence Influence on predictions I 3 is based on the distribution of the labels in each leaf Definition I 3 ▶ I 3 measures the distance between the distribution of the label in the nodes where x i falls ( ) I 3 ( x i ) = d p x i , T , p x i , T ( − i ) where d is the total variation distance ∑ J A ⊂{ 1 ; ... ; J } ∣ p ( A ) − q ( A ) ∣ = 2 − 1 d ( p , q ) = max ∣ p ( j ) − q ( j ) ∣ j = 1 J-M. Poggi Influence measures for CART

  10. Presentation Introduction Influence on predictions Influence measures for CART Influence on partitions Exploring the Paris Tax Revenues dataset CART specific notion of influence Influence on partitions Definition ▶ I 4 measures the variations on the number of clusters in each partition I 4 ( x i ) = ∣ T ( − i ) ∣ − ∣ T ∣ ▶ I 5 is based on the dissimilarity difference between the two partitions ( T ( − i ) ) ˜ T , ˜ I 5 ( x i ) = 1 − J ( T ( − i ) ) T , ˜ ˜ where J is the Jaccard dissimilarity between the partitions of T ( − i ) and ˜ ℒ defined by ˜ T (the sets of the leaves of the trees) ▶ Jaccard coefficient J ( C 1 , C 2 ) = a a + b + c a = number of pairwise points of ℒ in the same cluster in both partitions C 1 and C 2 b (resp. c )= number of pairwise points in the same cluster in C 1 , but not in C 2 (resp. in C 2 , but not in C 1 ) J-M. Poggi Influence measures for CART

  11. Presentation Introduction Influence on predictions Influence measures for CART Influence on partitions Exploring the Paris Tax Revenues dataset CART specific notion of influence CART specific influence Focus on the cp complexity cost constant ▶ consider the N cp ⩽ K T + ∑ 1 ⩽ i ⩽ n K T ( − i ) distinct values { cp 1 ; . . . ; cp N cp } where K T is the length of the sequence leading to tree T ▶ usually N cp << K T + ∑ 1 ⩽ i ⩽ n K T ( − i ) , since the jackknife sequences are the same for many observations Definition I 6 ▶ I 6 is the number of complexities for which these predicted labels differ N cp ∑ I 6 ( x i ) = 1 l T cpj ( x i ) ∕ = T ( − i ) ( x i ) cpj j = 1 1 ( x i ) indicates if the reference and jackknife subtrees l T cpj ( x i ) ∕ = T ( − i ) cpj corresponding to the same complexity cp j provide different predicted labels for x i J-M. Poggi Influence measures for CART

  12. Presentation Introduction Influence on predictions Influence measures for CART Influence on partitions Exploring the Paris Tax Revenues dataset CART specific notion of influence CART tree: pruning sequence Penalized criterion f ∣ T , ℒ n ) + 훼 ∣ ˜ T ∣ crit 훼 ( T ) = R n ( f , ˆ n R n ( f , ˆ f ∣ T , ℒ n ) the error term and ∣ ˜ T ∣ the number of leaves Pruning procedure: how to find T 훼 minimizing crit 훼 ( T ) for any given 훼 ▶ a finite decreasing (nested) sequence of subtrees pruned from T max T K = { t 1 } ≺ T K − 1 ≺ ... ≺ T 1 corresponding to critical complexities 0 = 훼 1 < 훼 2 < ... < 훼 K − 1 < 훼 K such that if 훼 k ≤ 훽 < 훼 k + 1 then T 훽 = T 훼 k = T k ▶ Remark: this sequence is a subsequence of the best trees of m leaves J-M. Poggi Influence measures for CART

  13. Introduction Presentation Influence measures for CART Classification problem Exploring the Paris Tax Revenues dataset Influential cities PATARE dataset ▶ Variables = characteristics of ▶ Tax revenues of households in the distribution of the tax 2007 from the 143 cities revenues per city surrounding Paris ▶ For each city: ▶ Cities are grouped into four ▶ first and 9th deciles (D1, D9) counties (“d´ epartement” in ▶ quartiles (Q1, Q2 and Q3) french) ▶ mean, and % of the tax ▶ Paris: 20 ”arrondissements” revenues coming from the (districts) salaries and treatments ▶ Seine-Saint-Denis (north of (PtSal) Paris): 40 cities ▶ Hauts-de-Seine (west of Paris): 36 cities ● ▶ Val-de-Marne (south of ● Paris): 48 cities ● ● ● ● ● ● ● ● ▶ Data freely available on http://www.data-publica. ● com/data ● J-M. Poggi Influence measures for CART

  14. Introduction Presentation Influence measures for CART Classification problem Exploring the Paris Tax Revenues dataset Influential cities PATARE dataset: the classification problem ▶ supervised classification problem (quaternary explained variable): to predict the county of the city with the characteristics of the tax revenues distribution ● ● ▶ it cannot be easily retrieved from the explanatory variables ● ● ● considered without the county ● ● ● ● information ● poor recovery of counties through clusters: map of the ● cities drawn according to a k -means ( k =4) clustering ● superimposed with the borders of the counties J-M. Poggi Influence measures for CART

Recommend


More recommend