Multivariate characterization of differences between groups Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Outline 1. Problem statement 2. Determination of the latent variables (dimensions) 3. Reading the results 4. A case study 5. Classification of a new instance 6. Statistical tools (Tanagra, lda of R, proc candisc of SAS) 7. Conclusion 8. References Ricco Rakotomalala 2 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Ricco Rakotomalala 3 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Descriptive Discriminant Analysis ( DDA ) - Goal A population is subdivided in K groups (using a categorical variable, a label); the instances are described by J continuous descriptors. Sun Heat Rain Quality Annee Temperature Soleil Chaleur Pluie Qualite 1924 3064 1201 10 361 medium E.g. Bordeaux wine (Tenenhaus, 1925 3000 1053 11 338 bad 1926 3155 1133 19 393 medium 2006; page 353). The rows of the 1927 3085 970 4 467 bad dataset correspond to the year of 1928 3245 1258 36 294 good 1929 3267 1386 35 225 good production ( 1924 to 1957 ) Descriptors Group membership Goal(s) : (1) Descriptive (explanation): highlighting the characteristics which enable to explain the differences between groups main objective in our context (2) Predictive (classification): assign a group to an unseen instance secondary objective in our context (but this is the main objective in the predictive discriminant analysis [PDA] context) Ricco Rakotomalala 4 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Descriptive Discriminant Analysis - Approach Aim : Determining the most parsimonious way to explain the differences between groups by computing a set of orthogonal linear combinations (canonical variables, factors) from the original descriptors. Canonical Discriminant Analysis. 1er axe AFD sur les var. Temp et Soleil 1750 z a ( x x ) a ( x x ) i 1 i 1 1 2 i 2 2 The conditional centroids must be as widely 1650 1550 separated as possible on the factors. 1450 1350 Soleil 2 2 2 z z n z z z z 1250 bad i k k ik k 1150 i k k i good 1050 medium v = b + w 950 850 Total (variation) = Between class (variation) + Within class (variation) 2800 3000 3200 3400 3600 3800 Temperature Ricco Rakotomalala 5 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Descriptive Discriminant Analysis – Approach (continued) Determining the coefficients (canonical Maximizing a measure of the class separability: coefficients) (a1,a2) which maximize the the correlation ratio. correlation ratio b 2 , 2 with 0 1 z , y z y v Maximum number of “dimensions” (factors): 1 Perfect discrimination. All the points related to a groups are M = min(J, K-1) confounded to the corresponding centroid (W = 0) 0 Impossible discrimination. All the centroids are confounded (B = 0) The factors are uncorrelated 1er axe AFD sur les var. Temp et Soleil 1750 2 0 . 051 1650 , z y 2 2 0 . 726 1550 z , y 1 A factor takes into account the differences 1450 1350 Soleil not explained by the preceding factors 1250 bad 1150 good 1050 medium 950 850 The correlation ratio measures the class 2800 3000 3200 3400 3600 3800 Temperature separability Ricco Rakotomalala 6 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Ricco Rakotomalala 7 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Descriptive Discriminant Analysis « a » is the vector of coefficients a 1 Mathematical formulation a which enables to define the a canonical variable Z i.e. J z a ( x x ) a ( x x ) 1 1 1 J J J Huyghens ’ theorem V = B + W Total sum of squares Total covariance matrix [ignoring a multiplication 1 TSS a ' Va V v x x x x factor (1/n)] lc il l ic c n i Within groups covariance matrix 1 W w x x x x RSS a ' Wa lc il , k l , k ic , k c , k n k i : y k i Between groups covariance matrix n k B b x x x x ESS a ' Ba lc l , k l c , k c n k The aim of DDA is to calculate the coefficients of the canonical variable which maximizes the correlation ratio a ' Ba 2 max max z , y a ' Va a a Ricco Rakotomalala 8 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Descriptive Discriminant Analysis Solution max ' a Ba a ' Ba a max is equivalent to ' a Va a a ' Va 1 Under the constraint (“ a ” is a unit vector) Solution: using the Lagrange function ( is the Lagrange multiplier) L ( a ) a ' Ba a ' Va 1 is the first eigenvalue of V -1 B L ( a ) 0 Ba Va “a” is the corresponding eigenvector a 1 V Ba a The successive canonical variables are obtained from the eigenvalues and the eigenvectors of V -1 B. The number of non-zero eigenvalue is M = min(K-1, J) i.e. M canonical variables 2 The eigenvalue is equal to the square of the correlation ratio (0 ≤ ≤ 1) is the canonical correlation Ricco Rakotomalala 9 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Discriminant descriptive analysis Number of factors Bordeaux wine (X1 : Temperature and X2 : Sun) M = min (J = 2; K-1 = 2) = 2 Z 0 . 0092 x x 0 . 0105 x x The differences between the centroids are i 2 i 1 1 i 2 2 lesser on this factor. 0 . 051 0 . 225 2 2.0 1.5 1.0 medium 0.5 Z 0 . 0075 x x 0 . 0075 x x bad good i 1 i 1 1 i 2 2 0.0 0 . 726 0 . 852 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0 1 -0.5 The differences between the centroids are high on this factor. -1.0 (2.91; -2.22): the coordinates of the -1.5 individuals in the new representation bad space are called “ factor scores ” (SAS, -2.0 good SPSS, R…) medium -2.5 Ricco Rakotomalala 10 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Discriminant descriptive analysis Alternative solution – English-speaking tools and references Since V = B + W, we can formulate the problem in other way: max ' a Ba ' is equivalent a Ba a max to ' 1 (“ a ” is a unit vector) w.r.t. a Wa a ' Wa a The factors are obtained from the eigenvalues and eigenvector of W -1 B. The eigenvectors of W -1 B are the same as those of V -1 B the factors are identical. The eigenvalues are related with the following formula: 1 m m = ESS / RSS m E.g. Bordeaux wine 2 With only the variables “temperature” and “sun” 0 . 8518 0 . 7255 2 . 6432 2 1 0 . 8518 1 0 . 7255 E.g. The first factor explains 98% of the global between-class variation: 98% Root Eigenvalue Proportion Canonical R = 2.6432 / (2.6432 + 0.0534). 1 2.6432 0.9802 0.8518 The two factors explain 100% of this variation [M = min(2, 3-1) = 2] 2 0.0534 1 0.2251 The first factor is enough here! we can state also the explained variation in Ricco Rakotomalala 11 percentage Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Ricco Rakotomalala 12 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Descriptive Discriminant Analysis – Determining the right number of factors We want to check N.B. Checking a factor individually is not H0: the correlation ratios of the "q" last factors are zero appropriate, because the relevance of a factor H0: 2 2 2 0 depends on the variation explained by the K q K q 1 K 1 H0: we can ignore the “q” remaining factors preceding factors. The lower is the value of LAMBDA, the K 1 Test statistic 2 1 more interesting are the factors. q m m K q In the case of Gaussian distribution (i.e. the data follows a multidimensional normal distribution in each group), we can use the Bartlett (chi-squared) or Rao transformation (Fisher). Wilks Root Eigenvalue Proportion Canonical R CHI-2 d.f. p-value Lambda 1 2.6432 0.9802 0.8518 0.260568 41.0191 4 0 2 0.0534 1 0.2251 0.949308 1.5867 1 0.207802 Ricco Rakotomalala 13 The two first factors are together significant at 5% level; but the last factor is not significant alone. Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Recommend
More recommend