multiblock method for categorical variables
play

Multiblock Method for Categorical Variables Application to the study - PowerPoint PPT Presentation

1. Position of the problem 2. Methods 3. Case study 4. Conclusions & perspectives Multiblock Method for Categorical Variables Application to the study of antibiotic resistance S. Bougeard 1 , E.M. Qannari 2 & C. Chauvin 1 1 French


  1. 1. Position of the problem 2. Methods 3. Case study 4. Conclusions & perspectives Multiblock Method for Categorical Variables Application to the study of antibiotic resistance S. Bougeard 1 , E.M. Qannari 2 & C. Chauvin 1 1 French agency for food, environmental and occupational health safety (Anses), Department of Epidemiology, Ploufragan, France 2 Nantes-Atlantic National College of Veterinary Medicine, Food Science and Engineering (Oniris), Department of Chemometrics and Sensometrics, Nantes, France 19 th International Conference on Computational Statistics, Paris, August 22 − 27, 2010 1 / 16

  2. 1. Position of the problem 2. Methods 3. Case study 4. Conclusions & perspectives Table of contents Position of the problem 1 Methods 2 Categorical multiblock Redundancy Analysis (Cat-mbRA) Alternative methods Case study 3 Study of antibiotic resistance Relationships between variables Risk factors for antibiotic resistance Method comparison 4 Conclusions & perspectives 2 / 16

  3. 1. Position of the problem 2. Methods 3. Case study 4. Conclusions & perspectives Statistical issues for epidemiological surveys 1. Advantages & limits of usual procedures Generalized linear models Well-adapted for categorical variables, Limited number of explanatory variables, Constraints when y consists of more than 2 categories. Decision trees, Random Forest 2. Expectations Small misclassification errors, Variables sorted in order of Global optimization criterion with magnitude, No regression coefficients. eigensolution, Assessement of the risk factors, Boosting, bagging, SVM Factorial representation of data. Small misclassification errors, No link with explanatory variables. → Multiblock modelling extended to categorical data. 3 / 16

  4. 1. Position of the problem 2. Methods 3. Case study 4. Conclusions & perspectives Statistical issues for epidemiological surveys 1. Advantages & limits of usual procedures Generalized linear models Well-adapted for categorical variables, Limited number of explanatory variables, Constraints when y consists of more than 2 categories. Decision trees, Random Forest 2. Expectations Small misclassification errors, Variables sorted in order of Global optimization criterion with magnitude, No regression coefficients. eigensolution, Assessement of the risk factors, Boosting, bagging, SVM Factorial representation of data. Small misclassification errors, No link with explanatory variables. → Multiblock modelling extended to categorical data. 3 / 16

  5. 1. Position of the problem 2. Methods 3. Case study 4. Conclusions & perspectives Statistical issues for epidemiological surveys 1. Advantages & limits of usual procedures Generalized linear models Well-adapted for categorical variables, Limited number of explanatory variables, Constraints when y consists of more than 2 categories. Decision trees, Random Forest 2. Expectations Small misclassification errors, Variables sorted in order of Global optimization criterion with magnitude, No regression coefficients. eigensolution, Assessement of the risk factors, Boosting, bagging, SVM Factorial representation of data. Small misclassification errors, No link with explanatory variables. → Multiblock modelling extended to categorical data. 3 / 16

  6. 1. Position of the problem 2. Methods 21. Cat-mbRA 3. Case study 22. Alternative methods 4. Conclusions & perspectives Table of contents Position of the problem 1 Methods 2 Categorical multiblock Redundancy Analysis (Cat-mbRA) Alternative methods Case study 3 Study of antibiotic resistance Relationships between variables Risk factors for antibiotic resistance Method comparison 4 Conclusions & perspectives 4 / 16

  7. 1. Position of the problem 2. Methods 21. Cat-mbRA 3. Case study 22. Alternative methods 4. Conclusions & perspectives Categorical multiblock Redundancy Analysis P X k is the projector onto the subspace spanned by the dummy variables associated with x k . Criterion to maximize ∑ k cov 2 ( u ( 1 ) , t ( 1 ) ) , with k || t ( 1 ) || = || v ( 1 ) || = 1 k ∑ k || P X k u ( 1 ) || 2 = v ( 1 ) ′ ˜ Y ′ ∑ k P X k ˜ Yv ( 1 ) with || v ( 1 ) || = 1 First order solution v ( 1 ) is the eigenvector of ∑ k ˜ Y ′ P X k ˜ Y associated with the largest eigenvalue λ ( 1 ) = ∑ k || P X k u ( 1 ) || 2 The latent variables represent the categorical variable coding : t ( 1 ) = X k w ( 1 ) , u ( 1 ) = ˜ Yv ( 1 ) k k 5 / 16

  8. 1. Position of the problem 2. Methods 21. Cat-mbRA 3. Case study 22. Alternative methods 4. Conclusions & perspectives Categorical multiblock Redundancy Analysis P X k is the projector onto the subspace spanned by the dummy variables associated with x k . Criterion to maximize ∑ k cov 2 ( u ( 1 ) , t ( 1 ) ) , with k || t ( 1 ) || = || v ( 1 ) || = 1 k ∑ k || P X k u ( 1 ) || 2 = v ( 1 ) ′ ˜ Y ′ ∑ k P X k ˜ Yv ( 1 ) with || v ( 1 ) || = 1 First order solution v ( 1 ) is the eigenvector of ∑ k ˜ Y ′ P X k ˜ Y associated with the largest eigenvalue λ ( 1 ) = ∑ k || P X k u ( 1 ) || 2 The latent variables represent the categorical variable coding : t ( 1 ) = X k w ( 1 ) , u ( 1 ) = ˜ Yv ( 1 ) k k 5 / 16

  9. 1. Position of the problem 2. Methods 21. Cat-mbRA 3. Case study 22. Alternative methods 4. Conclusions & perspectives Categorical multiblock Redundancy Analysis P X k is the projector onto the subspace spanned by the dummy variables associated with x k . Criterion to maximize ∑ k cov 2 ( u ( 1 ) , t ( 1 ) ) , with k || t ( 1 ) || = || v ( 1 ) || = 1 k ∑ k || P X k u ( 1 ) || 2 = v ( 1 ) ′ ˜ Y ′ ∑ k P X k ˜ Yv ( 1 ) with || v ( 1 ) || = 1 First order solution v ( 1 ) is the eigenvector of ∑ k ˜ Y ′ P X k ˜ Y associated with the largest eigenvalue λ ( 1 ) = ∑ k || P X k u ( 1 ) || 2 The latent variables represent the categorical variable coding : t ( 1 ) = X k w ( 1 ) , u ( 1 ) = ˜ Yv ( 1 ) k k 5 / 16

  10. 1. Position of the problem 2. Methods 21. Cat-mbRA 3. Case study 22. Alternative methods 4. Conclusions & perspectives Categorical multiblock Redundancy Analysis P X k is the projector onto the subspace spanned by the dummy variables associated with x k . Criterion to maximize ∑ k cov 2 ( u ( 1 ) , t ( 1 ) ) , with k || t ( 1 ) || = || v ( 1 ) || = 1 k ∑ k || P X k u ( 1 ) || 2 = v ( 1 ) ′ ˜ Y ′ ∑ k P X k ˜ Yv ( 1 ) with || v ( 1 ) || = 1 First order solution v ( 1 ) is the eigenvector of ∑ k ˜ Y ′ P X k ˜ Y associated with the largest eigenvalue λ ( 1 ) = ∑ k || P X k u ( 1 ) || 2 The latent variables represent the categorical variable coding : t ( 1 ) = X k w ( 1 ) , u ( 1 ) = ˜ Yv ( 1 ) k k 5 / 16

  11. 1. Position of the problem 2. Methods 21. Cat-mbRA 3. Case study 22. Alternative methods 4. Conclusions & perspectives Categorical multiblock Redundancy Analysis P X k is the projector onto the subspace spanned by the dummy variables associated with x k . Criterion to maximize ∑ k cov 2 ( u ( 1 ) , t ( 1 ) ) , with k || t ( 1 ) || = || v ( 1 ) || = 1 k ∑ k || P X k u ( 1 ) || 2 = v ( 1 ) ′ ˜ Y ′ ∑ k P X k ˜ Yv ( 1 ) with || v ( 1 ) || = 1 First order solution v ( 1 ) is the eigenvector of ∑ k ˜ Y ′ P X k ˜ Y associated with the largest eigenvalue λ ( 1 ) = ∑ k || P X k u ( 1 ) || 2 The latent variables represent the categorical variable coding : t ( 1 ) = X k w ( 1 ) , u ( 1 ) = ˜ Yv ( 1 ) k k 5 / 16

  12. 1. Position of the problem 2. Methods 21. Cat-mbRA 3. Case study 22. Alternative methods 4. Conclusions & perspectives Categorical multiblock Redundancy Analysis (Cat-mbRA) Partial components ( t 1 ,..., t K ) P X k is the projector onto the subspace spanned Projection of u ( 1 ) onto each subspace by the dummy variables associated with x k . P Xk u ( 1 ) spanned by X k → t ( 1 ) = k || P Xk u ( 1 ) || Synthesis with a global component t t ( 1 ) sums up all the partial codings : t ( 1 ) = ∑ k a ( 1 ) t ( 1 ) with k k ∑ k a ( 1 ) 2 = 1, k || P Xk u ( 1 ) || ∑ l || P Xl u ( 1 ) || 2 t ( 1 ) t ( 1 ) = ∑ k √ = k ∑ k P Xk u ( 1 ) √ ∑ l || P Xl u ( 1 ) || 2 6 / 16

  13. 1. Position of the problem 2. Methods 21. Cat-mbRA 3. Case study 22. Alternative methods 4. Conclusions & perspectives Categorical multiblock Redundancy Analysis (Cat-mbRA) Partial components ( t 1 ,..., t K ) P X k is the projector onto the subspace spanned Projection of u ( 1 ) onto each subspace by the dummy variables associated with x k . P Xk u ( 1 ) spanned by X k → t ( 1 ) = k || P Xk u ( 1 ) || Synthesis with a global component t t ( 1 ) sums up all the partial codings : t ( 1 ) = ∑ k a ( 1 ) t ( 1 ) with k k ∑ k a ( 1 ) 2 = 1, k || P Xk u ( 1 ) || ∑ l || P Xl u ( 1 ) || 2 t ( 1 ) t ( 1 ) = ∑ k √ = k ∑ k P Xk u ( 1 ) √ ∑ l || P Xl u ( 1 ) || 2 6 / 16

  14. 1. Position of the problem 2. Methods 21. Cat-mbRA 3. Case study 22. Alternative methods 4. Conclusions & perspectives Higher order solutions and optimal Cat-mbRA model Higher order solutions Aim : Orthogonalised regressions which take into account all the explanatory variables, i.e. orthogonal components ( t ( 1 ) ,..., t ( H ) ) . → Consider the residuals of the orthogonal projections of ( X 1 ,..., X K ) onto the subspaces spanned by t ( 1 ) , ( t ( 1 ) , t ( 2 ) ) , . . . Selection of the optimal model Additional information : Confusion matrix, ROC (=Receiver Operating Characteristic) curve. 7 / 16

Recommend


More recommend