LOGISTIC BIPLOTS FOR BINARY, NOMINAL AND ORDINAL DATA José Luis Vicente Villardón Universidad de Salamanca. Spain. villardon@usal.es http://biplot.usal.es
SUMMARY Classical Biplot methods allow for the simultaneous representation of individuals and continuous variables in a given data matrix. When variables are binary, nominal or ordinal, a classical linear biplot representation is not suitable. We propose a linear biplot representation based on logistic response models. The coordinates of individuals and variables are computed to have logistic responses along the biplot dimensions. The method is related to logistic regression in the same way that Classical Biplot Analysis (CBA) is related to linear regression, thus we refer to the method as Logistic Biplot (LB). In the same way as Linear Biplots are related to Principal Components Analysis, Logistic Biplots are related to Latent Trait Analysis or Item Response Theory. The geometry of those kinds of biplots is studied: For nominal data, the linear biplot results in a partition of the representation the divides the space onto a prediction region for each category; for ordinal data, we obtain a prediction direction with points separating each category. The usefulness of the proposal is illustrated using data on SNPs (Single Nucleotide Polymorphisms) from the HAPMAP project.
1.- INTRODUCTION 2.- CLASSICAL BIPLOT 2.1 Linear biplot based on alternating regressions/interpolations 2.2 Geometry of linear regression biplots 3.- LOGISTIC BIPLOT FOR BINARY DATA 3.1.-Formulation 3.2.- Parameter Estimation 3.2.- Geometry of logistic Biplots 4.- LOGISTIC BIPLOT FOR CATEGORICAL & ORDINAL DATA 4.1.-Formulation 4.2.- Parameter Estimation 4.3.- Geometry of logistic Biplots 4.- APPLICATION: HAPMAP
1.- INTRODUCTION CONTINUOUS DATA Biplot (GABRIEL, 1971) - Alternate regressions. (GABRIEL y ZAMIR, 1979) - Regression/calibration - Interpolation (GOWER & HAND, 1996) CATEGORICAL DATA LOGISTIC BIPLOT - Multiple Correspondence Analysis (ACM). - Prediction Regions (GOWER & HAND, 1996) -Generalised bilinear models. (FALGUEROLLES et al, 1995) -Segmented Bilinear Models (GABRIEL,1999) - Item Response Theory (BAKER, 1992)
2.- CLASSICAL BIPLOTS
2.2 Geometry of linear regression biplots
• Let L be the space spanned by the columns of A , usually two-dimensional. We complete L with a third dimension for the j-th variable and adjust the regression plane. • A linear response surface in the three-dimensional space is obtained. Let us call it H. • The set of points in H predicting a fixed value is given by the intersection between the plane normal to the third axis passing trough the fixed value, and H. That intersection is a straight line. • The points in L predicting different values of the variable are also on parallel straight lines. • The biplot axis can be completed with scales.
The squared R for the regressions is interpreted as a measure of the “quality of the representation” in the sense commonly used in Correspondence Analysis.
ALGORITHM
Interpolation
• If we fix the markers in A and adjust a logistic model for a two-dimensional representation, we obtain a logistic response surface H. In this case the third axis show a scale for the expected probabilities. • Although the response surface is non linear, the intersections of the planes normal to the probability axis and H are also straight lines on H. • The points in L predicting different probabilities are also on parallel straight lines.
• The direction of b j is given by (b j1 , b j2 ), the parameter estimates of the logistic biplot. • The biplot axis b j is completed with marks for projection points predicting probabilities; the main difference with the linear biplot is that equally spaced marks do not correspond with equally spaced probabilities. • To simplify the graphical representation we propose to add marks for fixed values of the predictions, for example, .25, .50 and .75. This will look like a symmetrical box-plot and no labels are necessary.
1. LOGISTIC BIPLOT FOR NOMINAL DATA Let X IxJ be a data matrix containing the values of J categorical variables -each with K j ( j=1, …, J ) categories- for I individuals, and let G IxL be the corresponding indicator matrix with columns. The last category of each variable will be used as a baseline. Let p i(jk) the expected probability that the category k of variable j be present at individual i . In the he multinomial logistic latent trait model we assume that the log-odds of each response (relative to the last category) follows a linear model where a is and b (jk)s ( i=1, …,I; j=1, …,J; k=1,…,K j -1 s=1, ..., S ) are the model parameters. In matrix form where O IxL is the matrix containing the expected odds, defines a biplot for the odds. Although the biplot for the odds may be useful, it would be more interpretable in terms of predicts probabilities and categories.
The points in L predicting different probabilities are no longer on parallel straight lines (see the figure with the response surfaces); this means that predictions on the logistic biplot are not made in the same way as in the linear biplots, the surfaces define now prediction regions for each category as shown in the contour graph.
The points in L predicting different probabilities are no longer on parallel straight lines (see the figure with the response surfaces); this means that predictions on the logistic biplot are not made in the same way as in the linear biplots, the surfaces define now prediction regions for each category as shown in the contour graph.
The points in L predicting different probabilities are no longer on parallel straight lines (see the figure with the response surfaces); this means that predictions on the logistic biplot are not made in the same way as in the linear biplots, the surfaces define now prediction regions for each category as shown in the contour graph.
The points in L predicting different probabilities are no longer on parallel straight lines (see the figure with the response surfaces); this means that predictions on the logistic biplot are not made in the same way as in the linear biplots, the surfaces define now prediction regions for each category as shown in the contour graph.
3. LOGISTIC BIPLOT FOR ORDINAL DATA Let X IxJ be a data matrix containing the values of J ordinal variables -each with K j ( j=1, …, J ) categories- for I individuals, and let G IxL be the cumulative indicator matrix with columns. The last category of each variable will be used as a baseline. Let p i(j ≤ k) = P( x ij ≤ k ). The ordinal logistic latent trait model for the cumulative probabilities is The equations define a biplot in the logit scale that shares the geometry of the binary case for each category. Observe that each category have a different constant but the same slopes, that means that the prediction direction is common to all categories and just the prediction markers are different. The parameters b define the direction of the projection; the representation subspace can be divided into prediction regions, for each category, delimited by parallel straight lines.
Ordinal data: Cummulative probabilities
Ordinal Data: Expected probabilities
Ordinal Data: Back Projection onto the biplot
4. PARAMETER ESTIMATION - Alternated generalized regressions and interpolations. (Maximum Likelihood). - Marginal Maximum Likelihood (As in Item Response Theory). - Separation problem (Maximum likelihood does not converge). Penalized Maximum Likelihood. - Iterative Majorization. - Other Methods - Heuristic approach for big data matrices: External Logistic Biplots (Logistic fits on the Principal Coordinates). 5. APPLICATIONS - HAPMAP Data. - Sugar cane germoplasm. - Innovation profiles in Portugal. - Irregular working force in Spain.
APPLICATION TO MAPMAP DATA
R 2 Selection Whole Set Bonferroni Selection
Interpretation of the biplot
Characterization of the groups
MULTBIPLOT (MULTivariate analysis using BIPLOTs http://biplot.usal.es/ClassicalBiplot/index.html
Recommend
More recommend