Lecture 18: Review Lecture Ani Manichaikul amanicha@jhsph.edu 15 May 2007
Types of Biostatistics n 1) Descriptive Statistics n Exploratory Data Analysis n often not in literature n Summaries n "Table 1" in a paper n Goal: visualize relationships, generate hypotheses
Types of Biostatistics n 2) Inferential Statistics n Confirmatory Data Analysis n Methods Section of paper n Goal: quantify relationships, test hypotheses
Approach to Modeling A general approach for most statistical modeling is to: n Define the Population of Interest n State the Scientific Questions & Underlying Theories n Describe and Explore the Observed Data n Define the Model n Probability part (models the randomness / noise) n Systematic part (models the expectation / signal)
Approach to Modeling n Estimate the Parameters in the Model n Fit the Model to the Observed Data n Make Inferences about Covariates n Check the Validity of the Model n Verify the Model Assumptions n Re-define, Re-fit, and Re-check the Model if necessary n Interpret the results of the Analysis in terms of the Scientific Questions of Interest
Stem-and-Leaf Plots n Age in years (10 observations) 25, 26, 29, 32, 35, 36, 38, 44, 49, 51 Age Interval Observations 20-29 5 6 9 30-39 2 5 6 8 40-49 4 9 50-59 1
Grouping: Frequency Distribution Tables n Shows the number of observations for each range of data n Intervals can be chosen in ways similar to stem-and-leaf displays Age Interval Frequency 20-29 3 30-39 4 40-49 2 50-59 1
Histograms n Pictures of the frequency or relative frequency distribution Histogram of Age 4 3 Frequency 2 1 1 2 3 4 Age Ca tegory
Box-and-Whisker Plots Box Plot of Age 50 45 Age in Years 40 35 30 25 n IQR = 44 – 29 = 15 n Upper Fence = 44 + 15* 1.5 = 66.5 n Lower Fence = 29 – 15* 1.5 = 6.5
2 Continuous Variables n Scatterplot Age by Height in cm 190 180 Height in Centimeters 170 160 150 25 30 35 40 45 50 Age in Years Scatterplots visually display the relationship between n two continuous variables
Why is the power of a test important? n Power indicates the chance of finding a “significant” difference when there really is one n Low power: like to obtain non-significant results even when significant differences exist n High power is desirable! n Low power is usually cause by small sample size
We’re not always right
Errors in Hypothesis Testing α n Aim: to keep Type I error small by specifying a small rejection region n α is set before performing a test, usually at 0.05
Errors in Hypothesis Testing β n Aim: To keep Type II error small and thus power high
β : Probability of Type II Error n The value of β is usually unknown since it depends on a specified alternative value. n β depends on sample size and α . n Before data collection, scientists decide n the test they will perform n α n the desired β n They will use this information to choose the sample size
P-Values n Definition: The p-value for a hypothesis test is the probability of obtaining by chance, alone, when H 0 is true , a value of the test statistic as extreme or more extreme (in the appropriate direction) than the one actually observed.
Steps of Hypothesis Testing n Define the null hypothesis, H 0 . n Define the alternative hypothesis, H a , where H a is usually of the form “not H 0 ”. n Define the type 1 error, α , usually 0.05. n Calculate the test statistic n Calculate the P-value n If the P-value is less than α , reject H 0 . Otherwise fail to reject H 0 .
Why use linear regression? n Linear regression is very powerful. It can be used for many things: n Binary X n Continuous X n Categorical X n Adjustment for confounding n Interaction n Curved relationships between X and Y
SLR: Y= � 0 + � 1 X 1 �� n Linear regression is used for continuous outcome variables n � 0 : mean outcome when X= 0 (Center!) n Binary X = “dummy variable” for group n � 1 : mean difference in outcome between groups n Continuous X n � 1 : mean difference in outcome corresponding to a 1-unit increase in X n Center X to give meaning to � 0 n Test � 1 = 0 in the population 20
Assumptions of Linear Regression n L Linear relationship n I Independent observations n N Normally distributed around line n E Equal variance across X’s
In Simple Linear Regression n In simple linear regression (SLR) : n One Predictor / Covariate / Explanatory Variable: X n In multiple linear regression (MLR): n Same Assumptions as SLR, (i.e. L.I.N.E.), but: n More than one Covariate: X 1 , X 2 , X 3 , …, X p Model: Y ~ N( µ , σ 2 ) n µ = E(Y | X) = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + ... β p Xp n
Regression Methods
Regression Methods
Nested models n One model is nested within another if the parent model contains one set of variables and the extended model contains all of the original variables plus one or more additional variables.
Difference in assessing variables: “nested models” n other predictor(s) n assess with t test if single variable defines predictor n assess with F test (today) if two or more variables are needed to define the predictor n potential confounder(s) n compare CI of primary predictor to see whether new parameter is significantly different
The F test H 0 : all new � ’s=0 in population H A : at least one new � is not 0 in population ( ) − RSS RSS parent nested ( ) # of new variables added = F obs RSS nested residual df nested ( ) − 69 . 6 49 . 8 2 = = F 4 . 4 obs 49 . 8 22 What is F cr ?
The F test: notes n The F test can be used to compare any two nested models n If only one variable is added, it’s easier to compare the models using the t test for that variable n t 2 = F if one variable is added n For any regression, the estimated variance of the residuals is RSS/(residual df)
Nested Models n Comparing nested models n 1 new variable: use t test for that variable n 2+ new variables: use F test n Categorical predictor n set one group as reference n create dummy variable for other groups n include/exclude all dummy variables n evaluate categorical predictor with F test
Effect Modification n In linear regression, effect modification is a way of allowing the association between the primary predictor and the outcome to change with the level of another predictor. n If the 3 rd predictor is binary, that results in a graph in which the two lines (for the two groups) are no longer parallel.
Splines and Quadratic Terms n Splines are used to allow the regression line to bend n the breakpoint is arbitrary and decided graphically or by hypothesis n the actual slope above and below the breakpoint is usually of more interest than the coefficient for the spline (ie the change in slope) n Quadratic term allows for curvature in the model 31
Logistic regression n For binary outcomes n Model log odds probability, which we also call the logit n Baseline term interpreted as log odds n Other coefficients are log odds ratios
Logistic regression model [ ] P(relief | Tx) = log odds(Relie f | Tx) log P(no relief | Tx) = β 0 + β 1 Tx 0 if Placebo where: Tx = 1 if Drug
Then… n log( odds(Relief|Drug) ) = β 0 + β 1 n log( odds(Relief|Placebo) ) = β 0 n log( odds(R|D)) – log( odds(R|P)) = β 1
And… odds(R | D) log = β 1 n Thus: odds(R | P) OR = exp( β 1 ) = e β 1 !! n And: n So: exp( β 1 ) = odds ratio of relief for patients taking the Drug-vs-patients taking the Placebo.
Logistic Regression Logit estimates Number of obs = 70 LR chi2(1) = 2.83 Prob > chi2 = 0.0926 Log likelihood = -46.99169 Pseudo R2 = 0.0292 ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- drug | .8137752 .4889211 1.66 0.096 -.1444926 1.772043 _cons | -.2876821 .341565 -0.84 0.400 -.9571372 .3817731 ------------------------------------------------------------------------------ Estimates: ˆ ˆ β β + log( odds(relief) ) = Drug 0 1 = -0.288 + 0.814(Drug) Therefore: OR = exp(0.814) = 2.26 !
Adding other variables n What if Pr(relief) = function of Drug or Placebo AND Age n We could easily include age in a model such as: log( odds(relief) ) = β 0 + β 1 Drug + β 2 Age
Logistic Regression n As in MLR, we can include many additional covariates. n For a Logistic Regression model with p predictors: log ( odds(Y= 1)) = β 0 + β 1 X 1 + ... + β p X p = = Pr( 1 ) Pr( 1 ) Y Y where: odds(Y= 1) = = − = = 1 Pr( 1 ) Y Pr( 0 ) Y
Types of interpretation n � 0 + � 1 = ln(odds) (for X= 1) n � 1 = difference in log odds + � 0 � e 1 = odds (for X= 1) n � e = odds ratio 1 n n But we started with P(Y= 1). Can we find that?
More useful math p robability = n odds − 1 p robability odds = p robability n + 1 odds + � � e ( ) 0 1 = = so p robability for X 1 + + � � n 1 e 0 1
Recommend
More recommend