PS 405 – Week 1 Section Intro to R and Summary Statistics D.J. Flynn January 14, 2014
Today’s plan Preliminaries Intro to R Basic univariate and bivariate stats Plots
Preliminaries ◮ Section: Tuesday, 5:00-6:00, Scott 212 ◮ Office Hours: Thursday, 12:30-2:00, Scott 230 ◮ Problem Sets: ◮ hard copies ◮ include code (annotated) ◮ neat tables (cleaned up in Word or L A T EX) ◮ grades: number correct (meaningless) ◮ Questions: substantive questions to office hours, please ◮ Website: my overheads/code will be posted at www.djflynn.org/teaching
Caveats ◮ this presentation: intro to the basics ◮ a lot of helpful R guides out there (see Thomas Leeper’s: thomasleeper.com/Rcourse/Intro2R/Intro2R.pdf ) ◮ 90% of R skills come from trial-and-error ◮ Google error messages ◮ pro tip: always know what you’re asking R to do (not just the code). Next quarter Jay will show you what’s going on behind the scenes.
R looks like this...
RStudio I highly recommend using a text editor, such as RStudio:
About R ◮ Almost entirely command-based (no point-and-click) ◮ Core functionalities already loaded; if you need anything else, load a package (we’ll do this) ◮ Advantages : FREE, extremely flexible, great graphics, increasingly the norm ◮ Disadvantages : steep learning curve, tedious code, very sensitive, unhelpful error messages
Practical tips 1 ◮ R is extremely sensitive: x � = X, Data � = data ◮ scroll through code using up and down arrows ◮ putting a question mark before a command will bring up the relevant help file: ?summary ◮ use pound signs (#) to annotate code as you go along ◮ ALWAYS save your code in a separate file (RStudio makes this easy) ◮ when R asks if you want to save the workspace image, say yes! 1 Most of these tips came from Salma Al-Shami’s slides from previous years (thanks, Salma!)
Basic commands ◮ R works like a calculator:
◮ Creating objects in R : ◮ constants: x<-5 constant=1 ◮ vectors: myvec<-c(1,2,3,4,5) myothervec<-c(6,7,8,9,10) colors<-c("blue","green","red","purple") ◮ matrices: mymatrix<-cbind(myvec,myothervec) my.other.matrix<-matrix(seq(1,100),10,10) ◮ data frames: mydataframe<-cbind.data.frame(myvec,myothervec)
Looking at data ◮ you have to tell R where to find variables: dataset$variable ◮ use attach() and detach() , but always know what dataset you’re referring to ◮ to look at an object, just type its name ◮ descriptives: mean median mode max min var sd range ◮ distributions: table() summary() head() ◮ variables: names(dataset) dataset$variable dataset$variable[obs1:obs2]
Practice looking at variables in the pre-loaded dataset faithful . Access it like this: install.packages("car") library(car) names(faithful)
Loading packages install.packages("nameofpackage") library(nameofpackage)
Loading data in R ◮ code depends on the type of file you’re attempting to load: read.table read.dta read.csv read.spss , etc.. ◮ two options: (1) tell R exactly where to find the dataset you want, or (2) set a working directory and then just tell it the file name ◮ I highly recommend the latter because typing long file paths can be a nightmare (e.g., typos, slashes, quotation marks) ◮ to load data not already in .R format, load the foreign package ◮ MUCH easier in RStudio (and on Macs)
Example using pilot.data.csv Option 1: Load from file path install.packages("foreign") library(foreign) pilot<-read.csv(" ∼ /Documents/TAing/winter 2014/section/week1/pilot.data.csv") names(pilot) Option 2: Set wd, then call up file setwd(" ∼ /Documents/TAing/winter 2014/section/week1") install.packages("foreign") library(foreign) pilot<-read.csv("pilot.data.csv") names(pilot) Option 3: Point-and-click open in RStudio
Types of variables and why we care ◮ nominal/categorical: can’t be ordered; distance not meaningful ◮ ordinal: can be ordered; distance may/may not be meaningful ◮ continuous: can be ordered; distance meaningful Model selection depends on type of DV. This class: continuous and quasi-continuous DVs Next class: categorical/limited DVs
Re-coding Raw data (especially secondary data, e.g., ANES) are ofen coded awkwardly, so we want to re-code: load("/Users/DJF/Documents/TAing/winter 2014/section/week1/nes2008.RData") practice<-nes08 summary(practice$partyid) #notice how responses are non-numeric Here I code Dems as 1, Reps as 2, Inds as 3, and others as missing: library(car) practice$newpartyid<-recode(practice$partyid,"’1. Democrat’=1; ’2. Republican’=2; ’3. Independent’=3;else=’NA’") It’s always a good idea to compare the distributions before and afer re-coding to make sure everything was done correctly: table(practice$partyid) table(practice$newpartyid)
Another recoding example (this time changing already numeric responses): library(car) pilot$gmf.new<-recode(pilot$gmf,"7=1;6=2;5=3;4=4;3=5; 2=6;1=7;else=NA") table(pilot$gmf) table(pilot$gmf.new)
Sub-setting We ofen want to subset data based on values of one or more variables (e.g., look only at Democrats, or voters>50, etc..): older<-subset(practice,V081104>=60) Does partyid vary by age? table(practice$partyid) table(older$partyid) CrossTable(practice$age,practice$partyid) Subsetting on older GOP voters: olderGOP<-subset(older,newpartyid==2) We could now run analyses on our subsets...
Basic bivariate stats ◮ Correlation (numeric variables) duration<-faithful$eruptions waiting<-faithful$waiting cor(duration,waiting) cor.test(duration,waiting) ◮ Crosstabulation (categorical variables) install.packages("gmodels") library(gmodels) CrossTable(nes08$partyid,nes08$marriage) CrossTable(nes08$partyid,nes08$bibleview) ◮ down the road: regression models
Sample plots hist(faithful$eruptions) Histogram of faithful$eruptions 60 Frequency 40 20 0 2 3 4 5 faithful$eruptions
hist(faithful$eruptions,breaks=20,col="lightblue2", main="Histogram of ’eruptions’ variable",xlab="x",ylab="freq(x)") Histogram of 'eruptions' variable 40 30 freq(x) 20 10 0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 x
hist(eruptions, breaks=20,col="lightblue2",main="Histogram of ’eruptions’ Variable",xlab="x",ylab="freq(x)",prob=TRUE) curve(dnorm(x, mean=mean(eruptions), sd=sd(eruptions)), add=TRUE) Histogram of 'eruptions' Variable 0.7 0.6 0.5 0.4 freq(x) 0.3 0.2 0.1 0.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 x
my.density<-density(faithful$eruptions) plot(my.density) density.default(x = faithful$eruptions) 0.5 0.4 0.3 Density 0.2 0.1 0.0 1 2 3 4 5 6 N = 272 Bandwidth = 0.3348
plot(my.density,col="seagreen3",main="PDF of ’eruptions’ variable",xlab="x",ylab="Pr(X=x)",lty=6,lwd=4)
plot(faithful$eruptions,faithful$waiting) 90 80 faithful$waiting 70 60 50 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 faithful$eruptions
plot(eruptions,waiting,main="Scatterplot of faithful Data",xlab="Eruptions",ylab="Waiting",pch=19) Scatterplot of faithful Data 90 80 Waiting 70 60 50 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
plot(eruptions ∼ waiting,main="Scatterplot with Regression Line",xlab="Eruptions",ylab="Waiting") abline(lm(eruptions ∼ waiting),col="blue",lwd=3)
plot(eruptions,waiting,main="Scatterplot with Smoothed Regression Line",xlab="Eruptions",ylab="Waiting",pch=20) lines(lowess(eruptions,waiting),col="red",lwd=3)
Recommend
More recommend