Analysis of variance and regression November 27, 2007
Other types of regression models • Counts (Poisson models) • Ordinal data – proportional odds models – model control – model interpretation • Survival analysis
Lene Theil Skovgaard, Dept. of Biostatistics, Institute of Public Health, University of Copenhagen e-mail: L.T.Skovgaard@biostat.ku.dk http://staff.pubhealth.ku.dk/~lts/regression07_2
Other types of regression, November 2007 1 Until now, we have been looking at • regression for normally distributed data, where parameters describe – differences between groups – effect of a one unit increase in an explanatory variable • regression for binary data, logistic regression, where parameters describe – odds ratios for a one unit increase in an explanatory variable
Other types of regression, November 2007 2 What about something ’in between’? • counts (Poisson distribution) – number of cancer cases in each municipality per year – number of positive pneumocock swabs • categorical variable with more than 2 categories, e.g. – degree of pain (none/mild/moderate/serious) – degree of liver fibrosis • non-normal quantitative measurements – censored data, survival analysis
Other types of regression, November 2007 3 Generalised linear models: Multiple regression models, on a scale suitable for the data: Mean: µ Link function: g( µ ) linear in covariates, i.e. g ( µ ) = β 0 + β 1 x 1 + · · · + β k x k An important class of distributions for these models: Exponential families , including • Normal distribution (link= identity ): the general linear model • Binomial distribution (link= logit ): logistic regression • Poisson distribution (link= log )
Other types of regression, November 2007 4 Poisson distribution: • distribution on the numbers 0,1,2,3,... • limit of Binomial distribution for N large, p small, mean: µ = Np – e.g. cancer events in a certain region • probability of k events: P ( Y = k ) = e − µ µ k k ! Example: positive swabs for 90 individuals from 18 families
Other types of regression, November 2007 5
Other types of regression, November 2007 6 Illustration of family profiles (we ignore the grouping of families here) O O O U O C U O O O C C C O O C O U C C O U C U U C C C O O O C O O O C U C U C O U O C O O C C U C O C C U U U U U U O O U O C O C C C U O C C U U O O U C C U U U U U U U C U O U
Other types of regression, November 2007 7 We observe counts y fn ∼ Poisson( µ fn ) Additive model , corresponding to two-way ANOVA in family and name : log( µ fn ) = µ + α f + β n proc genmod; class family name; model swabs=family name / dist=poisson link=log cl; run;
Other types of regression, November 2007 8 The GENMOD Procedure Model Information Data Set WORK.A0 Distribution Poisson Link Function Log Dependent Variable swabs Observations Used 90 Missing Values 1 Class Level Information Class Levels Values family 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 name 5 child1 child2 child3 father mother
Other types of regression, November 2007 9 Analysis Of Parameter Estimates Standard Wald 95% Chi- Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq Intercept 1 1.5263 0.1845 1.1647 1.8879 68.43 <.0001 family 1 1 0.4636 0.2044 0.0630 0.8641 5.14 0.0233 family 2 1 0.9214 0.1893 0.5503 1.2925 23.68 <.0001 family 3 1 0.4473 0.2050 0.0455 0.8492 4.76 0.0291 . . . . . . . . . . . . . . . . . . family 16 1 0.2283 0.2146 -0.1923 0.6488 1.13 0.2875 family 17 1 -0.5725 0.2666 -1.0951 -0.0499 4.61 0.0318 family 18 0 0.0000 0.0000 0.0000 0.0000 . . name child1 1 0.3228 0.1281 0.0716 0.5739 6.34 0.0118 name child2 1 0.8990 0.1158 0.6721 1.1259 60.31 <.0001 name child3 1 0.9664 0.1147 0.7417 1.1912 71.04 <.0001 name father 1 0.0095 0.1377 -0.2604 0.2793 0.00 0.9451 name mother 0 0.0000 0.0000 0.0000 0.0000 . . Scale 0 1.0000 0.0000 1.0000 1.0000 NOTE: The scale parameter was held fixed.
Other types of regression, November 2007 10 Interpretation of Poisson analysis: • The family -parameters are uninteresting • The name -parameters are interesting • The mothers serve as a reference group • The model is additive on a logarithmic scale, i.e. multiplicative on the original scale
Other types of regression, November 2007 11 Parameter estimates: name estimate (CI) ratio (CI) child1 0.3228 (0.0716, 0.5739) 1.38 (1.07, 1.78) child2 0.8990 (0.6721, 1.1259) 2.46 (1.96, 3.08) child3 0.9664 (0.7417, 1.1912) 2.63 (2.10, 3.29) father 0.0095 (-0.2604, 0.2793) 1.01 (0.77, 1.32) mother - - Interpretation: The youngest children have a 2-3 fold increased probability of infection, compared to their mother
Other types of regression, November 2007 12 Ordinal data , e.g. level of pain • data on a rank scale • distance between response categories is not known / is undefined • often an imaginary underlying quantitative scale Covariates must describe the probability for each single response category.
Other types of regression, November 2007 13 We are faced with a dilemma: • We may reduce to a binary outcome and use logistic regression – but there are several possible ’cuts’/thresholds • We can ’pretend’ that we are dealing with normally distributed data – of course most reasonable, when there are many response categories
Other types of regression, November 2007 14 Example on liver fibrosis (degree 0,1,2 or 3), (Julia Johansen, KKHH) 3 blood markers related to fibrosis: • HA • YKL40 • PIIINP Problem: What can we say about the degree of fibrosis from the knowledge of these 3 blood markers?
Other types of regression, November 2007 15 The MEANS Procedure Variable N Mean Std Dev Minimum Maximum -------------------------------------------------------------------------- degree_fibr 129 1.4263566 0.9903850 0 3.0000000 ykl40 129 533.5116279 602.2934049 50.0000000 4850.00 piiinp 127 13.4149606 12.4887192 1.7000000 70.0000000 ha 128 318.4531250 658.9499624 21.0000000 4730.00 --------------------------------------------------------------------------
Other types of regression, November 2007 16 We start out simple, with one single blood marker x p for the p ’th patient (here: p = 1 , · · · , 126). Y p : the observed degree of fibrosis for the p ’th patient. We wish to specify the probabilities π pk = P ( Y p = k ) , k = 0 , 1 , 2 , 3 and their dependence on certain covariates. Since π p 0 + π p 1 + π p 2 + π p 3 = 1, we have a total of 3 parameters for each individual.
Other types of regression, November 2007 17 We start by defining the cumulative probabilities ’from the top’: • divide between 2 and 3: model for γ p 3 = π p 3 • divide between 1 and 2: model for γ p 2 = π p 2 + π p 3 • divide between 0 and 1: model for γ p 1 = π p 1 + π p 2 + π p 3 Logistic regression for each threshold.
Other types of regression, November 2007 18 Proportional odds model, model for ’cumulative logits’: � γ pk � logit( γ pk ) = log = α k + β × x p , 1 − γ pk or, on the original probability scale: exp( α k + βx p ) γ pk = γ k ( x p ) = 1 + exp( α k + βx p ) , k = 1 , 2 , 3
Other types of regression, November 2007 19 Properties of the proportional odds model : • odds ratios do not depend on cutpoint, only on the covariates � γ k ( x 1 ) / (1 − γ k ( x 1 )) � log = β × ( x 1 − x 2 ) γ k ( x 2 ) / (1 − γ k ( x 2 )) • changing the ordering of the categories only implies a change of sign for the parameters
Other types of regression, November 2007 20 Probabilities for each degree of fibrosis ( k ) can be calculated as successive differences: exp( α 3 + βx ) π 3 ( x ) = γ 3 ( x ) = 1 + exp( α 3 + βx ) π k ( x ) = γ k ( x ) − γ k +1 ( x ) , k = 0 , 1 , 2 These are logistic curves
Other types of regression, November 2007 21 Cumulative probabilities:
Other types of regression, November 2007 22 We start out using only the marker HA Very skewed distributions, – but we do not demand anything about these!?
Other types of regression, November 2007 23 Proportional odds model in SAS: data fibrosis; infile ’julia.tal’ firstobs=2; input id degree_fibr ykl40 piiinp ha; if degree_fibr<0 then delete; run; proc logistic data=fibrosis descending; model degree_fibr=ha / link=logit clodds=pl; run;
Other types of regression, November 2007 24 The LOGISTIC Procedure Model Information Data Set WORK.FIBROSIS Response Variable degree_fibr Number of Response Levels 4 Number of Observations 128 Model cumulative logit Optimization Technique Fisher’s scoring Response Profile Ordered Total Value degree_fibr Frequency 1 3 20 2 2 42 3 1 40 4 0 26 Probabilities modeled are cumulated over the lower Ordered Values.
Recommend
More recommend