A Decision Tree for Interval-valued Data with Modal Dependent Variable Djamal Seck 1 , Lynne Billard 2 , Edwin Diday 3 and Filipe Afonso 4 1Departement de Mathematiques et Informatique, Universit´ e Cheikh Anta Diop de Dakar, Senegal djamal.seck@ucad.edu.sn 2Department of Statistics, University of Georgia, Athens GA 30605 USA lynne@stat.uga.edu 3 CEREMADE, University of Paris Dauphine 75775 Paris Cedex 16 France edwin.diday@ceremade.dauphine.fr 4 Syrokko, A´ e ropˆ o le de Roissy, Bat. A´ e ronef, 5 rue de Copenhague, 95731 Roissy Charles de Gaulle Cedex France, afonso@syrokko.com COMPSTAT - August 2010 Seck Symbolic Decision Tree
The Future Schweizer (1985): ”Distributions are the numbers of the future” Seck Symbolic Decision Tree
Types of Data Classical Data Value X : - A single point in p -dimensional space E.g., X = 17, X = 2 . 1, X = blue Seck Symbolic Decision Tree
Types of Data Classical Data Value X : - A single point in p -dimensional space E.g., X = 17, X = 2 . 1, X = blue Symbolic Data Value Y : - Hypercube or Cartesian product of distributions in p -dimensional space I.e. Y = list, interval, modal in structure Seck Symbolic Decision Tree
Types of Data Classical Data Value X : - A single point in p -dimensional space E.g., X = 17, X = 2 . 1, X = blue Symbolic Data Value Y : - Hypercube or Cartesian product of distributions in p -dimensional space I.e. Y = list, interval, modal in structure Modal data: Histogram, empirical distribution function, probability distribution, model, ... Seck Symbolic Decision Tree
Types of Data Classical Data Value X : - A single point in p -dimensional space E.g., X = 17, X = 2 . 1, X = blue Symbolic Data Value Y : - Hypercube or Cartesian product of distributions in p -dimensional space I.e. Y = list, interval, modal in structure Modal data: Weights: Relative frequencies Histogram, capacities, empirical distribution function, credibilities, probability distribution, necessities, model, ... possibilities, ... Seck Symbolic Decision Tree
Symbolic Data How do symbolic data arise? Aggregated data by classes or groups. 1 Research interest : classes or groups Natural symbolic data. 2 Pulse rate : 64 ± 2=[62,66]. Daily temperature : [55,67]. Published data : census data. 3 Symbolic data : range, list, and distribution, etc. 4 Seck Symbolic Decision Tree
Literature Review Clustering for classical data - CART, Breiman et al. (1984) 1 Clustering for symbolic data. 2 Agglomerative algorithm and dissimilarity measures for non-modal categorical and interval-valued data: Gowda and Diday (1991) Pyramid clustering: Brito (1991, 1994), Brito and Diday (1990) Spatial pyramids: Raoul Mohamed (2009) Divisive monothetic algorithm for intervals: Chavent (1998,2000) Divisive algorithms for histograms: Kim (2009) Decision trees for non-modal dependent variables: P´ e rinel (1996, 1999), Limam (2005), Winsberg et al. (2006), ... ...... Seck Symbolic Decision Tree
Literature Review Clustering for classical data - CART, Breiman et al. (1984) 1 Clustering for symbolic data. 2 Agglomerative algorithm and dissimilarity measures for non-modal categorical and interval-valued data: Gowda and Diday (1991) Pyramid clustering: Brito (1991, 1994), Brito and Diday (1990) Spatial pyramids: Raoul Mohamed (2009) Divisive monothetic algorithm for intervals: Chavent (1998,2000) Divisive algorithms for histograms: Kim (2009) Decision trees for non-modal dependent variables: P´ e rinel (1996, 1999), Limam (2005), Winsberg et al. (2006), ... ...... Decision tree for interval data and modal dependent variable (STREE): Seck (2010) (a CART methodology for symbolic data) Seck Symbolic Decision Tree
The Data We have observations Ω = { ω 1 , . . . , ω n } , where ω i has realization Y i = ( Y i 1 , . . . , Y ip ) , i = 1 , . . . , n . Modal multinominal (Modal categorical): � s i Y ij = { m ijk , p ijk ; k = 1 , . . . , s i } , k =1 p ijk = 1 , with m ijk ∈ O j = { m j 1 , . . . , m js } , j = 1 , . . . , p i = 1 , . . . , n . (Take s i = s , wlg.) Multi-valued (non-modal): Y ij = { m ijk , k = 1 , . . . , s i } , i.e., p ijk = 1 / s or 0, with m ijk ∈ O j , j = 1 , . . . , p , i = 1 , . . . , n . Intervals: Y i = ([ a i 1 , b i 1 ] , . . . , [ a ip , b ip ]), with a ij , b ij ∈ R j , j = 1 , . . . , p , i = 1 , . . . , n . Nominal (classical categorical): Special case of modal multinominal with s i = 1 , p 1 = 1; write Y ij ≡ m ij 1 = δ ij , δ ij ∈ O j . Classical continuous variable: Special case of interval with a ij = [ a ij , a ij ] for a ij ∈ R j . Seck Symbolic Decision Tree
STREE Algorithm Have at r th stage the partition P r = ( C 1 , . . . , C r ) Discrimination criterion: D ( N ) - explains partition of node N as in CART analysis Homogeneity criterion: H ( N ) - inertia associated with explanatory variables as in pure hierarchy tree analysis We take the mixture, for α > 0 , β > 0, I = α D ( N ) + β H ( N ) with α + β = 1 . The D ( N ) is taken as the Gini measure (as in CART) � � p 2 D ( N ) = p i p f = 1 − i i � = f i =1 ,..., r with p i = n i / n , n i = card( N � C i ), n = card ( N ); the H ( N ) is p i 1 p i 2 � � 2 µ d 2 ( ω i 1 , ω i 2 ) H ( N ) = ω i 1 ∈ Ω ω i 2 ∈ Ω where d ( ω i 1 , ω i 2 ) is a distance measure between ω i 1 and ω i 2 , p i is the weight associated with ω i and µ = � N i =1 p i . Seck Symbolic Decision Tree
STREE Algorithm Have at r th stage the partition P r = ( C 1 , . . . , C r ) Discrimination criterion: D ( N ) - explains partition of node N as in CART analysis Homogeneity criterion: H ( N ) - inertia associated with explanatory variables as in pure hierarchy tree analysis We take the mixture, for α > 0 , β > 0, I = α D ( N ) + β H ( N ) with α + β = 1 . The D ( N ) is taken as the Gini measure (as in CART) � � p 2 D ( N ) = p i p f = 1 − i i � = f i =1 ,..., r with p i = n i / n , n i = card( N � C i ), n = card ( N ); the H ( N ) is p i 1 p i 2 � � 2 µ d 2 ( ω i 1 , ω i 2 ) H ( N ) = ω i 1 ∈ Ω ω i 2 ∈ Ω where d ( ω i 1 , ω i 2 ) is a distance measure between ω i 1 and ω i 2 , p i is the weight associated with ω i and µ = � N i =1 p i . Select the partition C = { C 1 , C 2 } for which the reduction in I is greatest; i.e., maximize ∆ I = I ( C ) − I ( C 1 , C 2 ). Seck Symbolic Decision Tree
Decision Tree - Distance Measures The homogeneity criterion H ( N ) p i 1 p i 2 � � 2 µ d 2 ( ω i 1 , ω i 2 ) H ( N ) = ω i 1 ∈ Ω ω i 2 ∈ Ω where d ( ω i 1 , ω i 2 ) is a distance measure between ω i 1 and ω i 2 , p i is the weight associated with ω i and µ = � N i =1 p i . Seck Symbolic Decision Tree
Decision Tree - Distance Measures The homogeneity criterion H ( N ) p i 1 p i 2 � � 2 µ d 2 ( ω i 1 , ω i 2 ) H ( N ) = ω i 1 ∈ Ω ω i 2 ∈ Ω where d ( ω i 1 , ω i 2 ) is a distance measure between ω i 1 and ω i 2 , p i is the weight associated with ω i and µ = � N i =1 p i . The STREE algorithm uses Modal categorical variables - L 1 distance: d j ( ω i 1 , ω i 2 ) = � k ∈O | p i 1 jk − p i 2 jk | ; or, L 2 distance: k ∈O ( p i 1 jk − p i 2 jk ) 2 d j ( ω i 1 , ω i 2 ) = � Interval variables - Hausdorff distance: d j ( ω i 1 , ω i 2 ) = max( | a i 1 j − a i 2 j | , | b i 1 j − b i 2 j | ) Classical categorical variables - (0 , 1) distance: � 0 , if m i 1 j = m i 2 j d j ( ω i 1 , ω i 2 ) = 1 , if m i 1 j � = m i 2 j Classical continuous variables - Euclidean distance: d j ( ω i 1 , ω i 2 ) = ( a i 1 j − a i 2 j ) 2 Seck Symbolic Decision Tree
Decision Tree - Distance Measures The homogeneity criterion H ( N ) p i 1 p i 2 � � 2 µ d 2 ( ω i 1 , ω i 2 ) H ( N ) = ω i 1 ∈ Ω ω i 2 ∈ Ω where d ( ω i 1 , ω i 2 ) is a distance measure between ω i 1 and ω i 2 , p i is the weight associated with ω i and µ = � N i =1 p i . The STREE algorithm uses Modal categorical variables - L 1 distance: d j ( ω i 1 , ω i 2 ) = � k ∈O | p i 1 jk − p i 2 jk | ; or, L 2 distance: k ∈O ( p i 1 jk − p i 2 jk ) 2 d j ( ω i 1 , ω i 2 ) = � Interval variables - Hausdorff distance: d j ( ω i 1 , ω i 2 ) = max( | a i 1 j − a i 2 j | , | b i 1 j − b i 2 j | ) Classical categorical variables - (0 , 1) distance: � 0 , if m i 1 j = m i 2 j d j ( ω i 1 , ω i 2 ) = 1 , if m i 1 j � = m i 2 j Classical continuous variables - Euclidean distance: d j ( ω i 1 , ω i 2 ) = ( a i 1 j − a i 2 j ) 2 Hence, d ( ω i 1 , ω i 2 ) = � p j =1 d j ( ω i 1 , ω i 2 ). Seck Symbolic Decision Tree
Decision Tree - Cut Points Cut points: Take Modal categorical case – Recall � s i Y ij = { m ijk , p ijk ; k = 1 , . . . , s i } , k =1 p ijk = 1 , (Take s i = s , wlg.) with m ijk ∈ O j = { m j 1 , . . . , m js } , j = 1 , . . . , p i = 1 , . . . , n . First: For each k in turn, order p ijk from smallest to largest. There are L k ≤ n distinct values of p jkr , r = 1 , . . . , L k . Then, cut point for this modality ( m jk ) is the probability c jkr = ( p jkr + p jk , r +1 ) / 2 , r = 1 , . . . , L k − 1 , k = 1 , . . . , s . There are � s k =1 ( L k − 1)possible partitions for each j . Seck Symbolic Decision Tree
Recommend
More recommend