correspondence analysis
play

Correspondence Analysis. P. CAZES CEREMADE, University Paris - PowerPoint PPT Presentation

Some Comments on Correspondence Analysis. P. CAZES CEREMADE, University Paris Dauphine Overview Data analysis as an experimental science The Laboratory of Statistics of University Paris 6 in the seventies Coding Correspondence


  1. Some Comments on Correspondence Analysis. P. CAZES CEREMADE, University Paris Dauphine

  2. Overview • Data analysis as an experimental science • The Laboratory of Statistics of University Paris 6 in the seventies • Coding • Correspondence analysis as a particular case of other methods. • Correspondence analysis and modelling techniques • Correspondence analysis and Data analysis since 2000 • Bibliography 2

  3. Data analysis as an experimental science • Theoretical results have been discovered and demonstrated after having been observed on the computer listings as in experiences in physics. • Indices : inertia rates, contributions, test-value, etc. have been set up to validate the results as the error computations in physics • Coding techniques allow defining the ad-hoc table to be analyzed and the succession of the analyses (descriptive, explicative or decisional analyses) to be done to treat the data. This problem is analogous to the set up of an experiment in physics. 3

  4. Examples of results discovered and demonstrated after seen on listing • In CA of a contingence table crossing two sets I and J, Inertia moments of factorial axis of the clouds N I and N J associated respectively to I and J are equal, result which is now standard ( B. Escofier Phd, 1965 ) • CA of a doubling table of 0 and 1 has an total inertia equal to 1 4

  5. • CA of a doubling table of 0 and 1 is equivalent to Normed PCA of the non dedoubling (or initial) table (Benzecri, J.P. Pagès, Bara : PHD, Serums data,1971). – furthermore  CA =  NPCA / p where p is the number of variables or columns of initial table. – Then, we find again that : Inertia in CA = Inertia in NPCA /p = p / p = 1 – Same representation of the lines on factorial axis (with the factor 1/  p to pass of NPCA to CA ) 5

  6. The Laboratory of Statistics of University Paris 6 in the seventies • Pr. Benzecri director of the laboratory and also Responsible of Master 2 (M2) of statistics (Research master) with150 to 200 students (the „„greatest‟‟ M2 of France) • 40 PhD defended each year since 1974 (3 or 4 defenses each Monday in May and June) • Examples of applications very numerous and diversified: – Biology – Medicine – Ecology – Physics – Economy – Psychology – Geology – Sociology – Linguistics 6

  7. • Diversity of student‟s origin: – French, of course, but also African, Argentinean, Greek, Egyptian, Iranian, Irish, Libanian, Syrian,Turk,Vietnamese etc… • Consequences – Great discussions – numerous ideas, – exceptional impact of the laboratory : • publications, colloquiums, etc… 7

  8. Publications of Professor Benzecri • Creation, in 1976 of the Cahiers de l‟Analyse des Données (CAD) which have been numerised by NUMDAM at the end of 2010 but is not to day in line • Publication, in 1973, of the two famous books on Data Analysis: L‟analyse des données : – Tome 1 : la taxinomie – Tome 2 : l‟analyse des correspondances • Publication in 1982 of the book: Histoire et préhistoire de l‟analyse des données 8

  9. • Publication of the 5 books of the collection “Pratique de l‟Analyse des Données” – Tome 1 : Analyse des correspondances. Exposé élémentaire, 1980 (Traduced in english by Gopolan in 1992) – Tome 2 : Abrégé théorique. Etude de cas modèles, 1980 – Tome 3 : Linguistique & lexicologie, 1981 – Tome 4 : Médecine, pharmacologie, physiologie clinique, 1992 – Tome 5 : Economie, 1986 9

  10. Colloquiums • very friendly and productive • in numerous French universities starting in 1970: – Besançon – Marseille – Nice, – Rennes – l‟Arbresle near Lyon – etc… 10

  11. Coding I Usual coding • Doubling of a table of data (notes, ranks, 0 and 1, etc…) • Complete disjunctive coding (0, 1) • Fuzzy coding : – barycentric coding at 3 or r modalities of a quantitative variable – coding allowing to get rid of the subject personal equation when the subjects give a certain number of notes (coding different when the individual changes) 11

  12. • Case of Exchange Table k IJ with I = J (Leontiev Table, Importation-exportation table) Example : – k(i, j) : total of the importations from the country i toward the country j – Do CA of the table (k IJ , k JI ), juxtaposition of the table k IJ and its transposed k JI – This allows to have on one line i all the exchanges of the country i toward the country j (importations and exportations). 12

  13. – Yagolnitzer [CAD, 1977] suggested doing CA of the following table: k IJ k JI k JI k IJ – Yagolnitzer analysis is equivalent to do • CA of the mean exchange table (k IJ + k JI )/ 2 and • Factorial analysis of the flux table (k IJ - k JI )/ 2 with the ponderations (weights and metric) given by the CA of the table (k IJ + k JI )/ 2. 13

  14. • Other techniques of coding: – use of the supplementary (passive or illustrative) elements • to refine the interpretation that appears in the ternary table • in certain procedures like discriminant analysis or scoring • Etc. – Etc… 14

  15. II Coding allowing obtaining the equivalence with other analyses II1 Case of Principal Component Analysis (PCA). X = { x ij | 1  i  n, 1  j  p) ) crossing a set I of n individuals with p variables PCA of centered X is equivalent to CA of doubling table Y with Y = { [(A + x ij ) /2 , (A - x ij ) /2 ) ] | 1  i  n , 1  j  p} A is any real positive value  CA =  PCA / (pA 2 ) Same representation of the lines on factorial axis (with the factor 1/(A  p) to pass of PCA to CA ) 15

  16. Terms of Y can be negative, then Eigenvalues in CA of Y can be greater than 1  A 2    CA  1  Where is the mean of eigenvalues in PCA of X Particular case (B. Escofier CAD, 1979) If the variables of X are also reduced (Variances equal to 1) and A =1 PCA of X i.e. NPCA is equivalent to CA of Y  (and here A = = 1) 16

  17. II 2 Analyze with respect to a model (Escofier, RSA 1984) • Comparison of a frequency table f IJ with a reference table m IJ : – Analyze the difference f IJ - m IJ with the ponderations given by CA of f IJ . – If m IJ have the same margins f I and f J than f IJ , the precedent comparison is equivalent to CA of f IJ - m IJ + f I  f J • Particular cases: – Intraclass analysis : • I (or J) is partitioned in subsets : I =  {I k | k = 1, r]} – Double Intraclass analysis or internal analysis: • I and J are partitioned in subsets – Generalizations when partitions are replaced by graphs (Benali- Escofier, RSA, 1990; Cazes - Moreau 1991) 17

  18. Correspondence analysis as a particular case of other methods. • k IJ Contingency table crossing 2 qualitative variables X and Y. – I and J : sets of modalities of X and Y respectively. • CA of k IJ is a double factorial analysis: – factorial analysis of profile lines and profile columns of k IJ • CA is the canonical analysis of the two sub-spaces W X and W Y respectively spanned by the indicator variables of modalities of X and Y respectively. – Indeed, this way of thinking corresponds to the research of the optimal coding (in fact the factors) centered and reduced of X and Y. 18

  19. • CA , as underlined by L. Lebart, can be considered as a double discriminant analysis: – In the first analysis, the variable to be explained is the qualitative variable Y and the explicative variables are the indicator variables of X – In the second analysis, it is the same, exchanging X and Y. • CA corresponds also to the interbattery analysis of Tucker (1958) of the table T X and T Y respectively associated to the indicator variables of X and Y, with the weight diagonal metrics given by the margins of k IJ (or the line margins of T X and T Y ). • Multiple correspondence analysis or MCA (analysis of the complete disjunctive table associated to q qualitative variables X 1 ,…, X q ) is a particular case of the Generalized Canonic Analysis of Carroll where the associated sub-spaces are respectively spanned by the indicator variables of X 1 ,…, X q 19

  20. • MCA is equivalent to the Factorial Multiple Analysis (FMA, Escofier – Pagès, 1998) of the complete disjunctive table, each sub-table corresponding to the modalities of one of the variables X k (1  k  q), since CA of each sub-table has all its eigenvalues equal to 1 and therefore that the ponderations of each sub-table with the inverse square root of each greater eigenvalue (here 1) do not change anything. • CA of a sub-table of Burt crossing two sub-sets of questions can be considered in many different ways as multiple co-inertia analysis, Chessel, 1993). • Etc. This is this possibility of CA to be a particular case of numerous methods that implies its great importance in theoretical as well as practical point of view. 20

  21. Correspondence analysis and modelling techniques • I The reconstitution formula considered as a modelling technique Exact data reconstitution formula of a frequency table f IJ from the margins f I and f J ,the factors  I α ,  J α associated to the t non null eigenvalues  α coming from CA is given by f ij = f i . f. j (1 +  {(  α ) 1/2  i α  j α | α = 1, t } ) If we keep the r first factors (r<t), we have the Least Square * of f IJ : approximation of order r f IJ * = f i . f. j (1 +  {(  α ) 1/2  i α  j α | α = 1, r } ) f ij * + e IJ f IJ = f IJ 21 || e IJ || 2 =  {(e ij ) 2 / (f i . f . j ) | i  I, j  J } Min

Recommend


More recommend