PS 405 – Week 4 Section: Difference of means, ANOVA, and Matrix Algebra D.J. Flynn February 4, 2014
t-tests ◮ for equality of two sample means ◮ hypotheses: H 0 : no difference in sample means H A : significant difference ◮ calculating the t-stat: t = statistic - hypothesized difference SE of estimate
Gender/partisanship example Question: are men and women equally likely to be Democrats? t-stat for difference in proportions: ( P m − P f ) t = � P m ( 1 − P m ) + P f ( 1 − P F ) n m n F p-value that R estimates is for null of no difference; confidence interval is for difference between two sample means Interpretation of p-value: “if null hypothesis is true, how ofen would we observe a difference this large under repeated sampling?” – NOT there is a p % chance that the true difference is equal to X.
Logic of ANOVA and the F test ◮ running theme: experiments with > 2 groups ◮ does assignment to a particular group (X) affect some continuous outcome (Y)? ◮ this question can be answered with one-way ANOVA (AKA F-test) ◮ two sources of variation in DV: ◮ intended: independent variable/factor ◮ unintended: error/residual ◮ goal of ANOVA: determine share of variance explained by X
ANOVA table ◮ Go through table quickly ◮ F statistic (sometimes called F-act): unexplained variance = MS A explained variance F = MS E ◮ look up critical F-stat based on numerator df, denominator df, and confidence level ◮ if F-act > F-critical, then we reject the null of independence
ANOVA in R 1. identify independent and dependent variables 2. determine variable structures (and change if necessary) 3. estimate ANOVA and call up results
Determining variable structure ◮ str(variable) returns the structure of a variable: integer, factor, character, number, logical ◮ important because ANOVAs are used for categorical IVs ◮ practice: install.packages("datasets") library(datasets) names(chickwts) str(chickwts$weight) str(chickwts$feed) levels(chickwts$feed)
Estimating ANOVAs in R anova<-aov(weight ∼ feed,data=chickwts) summary(anova) Df Sum Sq Mean Sq F value Pr(>F) feed 5 231129 46226 15.37 5.94e-10 *** Residuals 65 195556 3009 --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’
What happens if we instead estimate aov(feed ∼ weight) ? wrong.model<-aov(feed ∼ weight,data=chickwts) Warning messages: 1: In model.response(mf, "numeric") : using type = "numeric" with a factor response will be ignored 2: In Ops.factor(y, z$residuals) : - not meaningful for factors
Another example We have data on which undergraduate institution people attended and mid-life satisfaction (0-100): names(my.data) [1] "school" "satisfaction" table(my.data$school) school fsu uf um 5 5 5 my.anova<-aov(satisfaction ∼ school,data=my.data) summary(my.anova) Df Sum Sq Mean Sq F value Pr(>F) school 2 7216 3608 11.85 0.00144 ** Residuals 12 3655 305 --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’
fsu<-subset(my.data,school=="fsu") uf<-subset(my.data,school=="uf") um<-subset(my.data,school=="um") mean(fsu$satisfaction) [1] 92.6 mean(uf$satisfaction) [1] 39.2 mean(um$satisfaction) [1] 60.8
Changing variable structure ◮ Current structure: is.factor is.numeric is.character is.vector ... will return TRUE or FALSE ◮ New structure: as.factor as.numeric as.character as.vector ... will change object to desired structure
Generalizations of the one-way ANOVA 1. two-way ANOVA: if we have more than 1 explanatory factor (e.g., soil type + type of potato = potato yield) 2. ANCOVA: ANOVA with a continuous covariate (e.g., soil type + type of potato + weather = potato yield)
Example of two-way ANOVA in R Does income depend on type of profession and education? library(car) names(Prestige) [1] "education" "income" "women" "prestige" "census" "type" str(Prestige$education) num [1:102] 13.1 12.3 12.8 11.4 14.6 ... str(Prestige$type) Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 ...
summary(Prestige$education) Min. 1st Qu. Median Mean 3rd Qu. Max. 6.380 8.445 10.540 10.740 12.650 15.970 Prestige$education.recoded<-recode(Prestige$education, "6.38:8.445=1;8.446:10.54=2;10.55:10.74=3;10.75:12.65=4; 12.66:15.97=5;else=NA") table(Prestige$education.recoded) 1 2 3 4 5 26 25 2 23 26 as.factor(Prestige$education.recoded) [1] 5 4 5 4 5 5 5 5 5 5 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 [37] 4 3 4 2 4 4 2 2 2 4 4 4 4 2 4 2 2 2 4 4 4 2 4 1 2 3 2 [73] 1 1 1 2 2 1 1 1 2 2 2 1 1 2 1 2 2 1 1 1 1 1 1 4 2 1 1 Levels: 1 2 3 4 5
my.two.way<-aov(income ∼ type+education.recoded, data=Prestige) summary(my.two.way) Df Sum Sq Mean Sq F value Pr(>F) type 2 5.960e+08 297978078 25.266 1.65e-09 ** education.recoded 1 2.952e+07 29520188 2.503 0.117 Residuals 94 1.109e+09 11793647 --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 4 observations deleted due to missingness
Matrix algebra terms ◮ scalar ◮ vector ◮ matrix
Matrix algebra operations ◮ addition ◮ subtraction ◮ multiplication ◮ inverse ◮ transpose
Why we care: the linear model ◮ Scalar form : Y i = β 0 + β 1 X 1 i + β 2 X 2 i + ...β K X Ki + ǫ i ◮ Matrix form : Y i = X i β + ǫ i Benefits of matrix form: 1. more parsimonious expression of models with lots of covariates 2. understand what’s going on behind the scenes. For example, the parameter β is estimated by calculating ( X T X ) − 1 X T y
This is the linear model in matrix form: Y i = X i β + ǫ i For each term in this equation... ◮ scalar, vector, or matrix? ◮ size?
Recommend
More recommend