exploratory data analysis
play

Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul - PowerPoint PPT Presentation

Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 1 / 46 Outline Data, revisited The purpose of exploratory data analysis Learning to see Paul Cohen ISTA 370 ()


  1. Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 1 / 46

  2. Outline Data, revisited The purpose of exploratory data analysis Learning to see Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 2 / 46

  3. Data: A Review Things and Data Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 3 / 46

  4. Data: A Review Things and Data Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 4 / 46

  5. Data: A Review Where Data Come From Data are measurements of individuals (people, trees, countries, ecosystems...). An ¡Individual Data 56 ¡years ¡old 70" ¡tall 180 ¡lbs Brown ¡eyes Moderately ¡presbyo8c Good ¡health Married One ¡child A ¡Data ¡Table ... Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 5 / 46

  6. Exploratory Data Analysis What is Exploratory Data Analysis (EDA)? In terms of the Fundamental Model of Data, y = f ( x , ǫ ) : EDA infers which factors strongly and weakly influence y and the functions that combine these factors EDA examines ǫ to see whether it contains evidence of other important but unmeasured ( “hidden” ) factors Confirmatory studies test whether x really is a causal factor that influences y Exploratory studies are to confirmatory studies as test kitchens are to cookbook recipes. EDA generally doesn’t test hypotheses, but, rather,“helps the data tell its story” EDA helps you understand phenomena, and suggests hypotheses to test in confirmatory studies. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 6 / 46

  7. Exploratory Data Analysis What is Exploratory Data Analysis? Learning to See Data have structure that is evidence of causal influences. EDA uncovers, exposes, clarifies this structure. EDA is like hunting for fossils – it’s a skill, and you must“learn to see”not only what’s in front of you, but what lies within data. EDA asks,“what do I see, and what does it mean?” Like any other skill, EDA takes a lot of practice. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 7 / 46

  8. Exploratory Data Analysis Load Some Data > read.table.ISTA370<-function(filename){ dataURL<-"http://www.sista.arizona.edu/~cohen/ISTA%20370/D # Reads a data frame from a URL path rooted at ISTA370 dat read.table(paste(dataURL,filename,sep="")) } > > # taheri<-read.table.ISTA370("taheri1.txt") > # iris<-read.table.ISTA370("iris.txt") > # heightC<-read.table.ISTA370("heightC.txt") > # treering<-read.table.ISTA370("treering1.txt") > # blast<-read.table.ISTA370("blastSummary.txt") > # kinect<-read.table.ISTA370("onemovie.txt") > # readability<-read.table.ISTA370("readability.txt") Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 8 / 46

  9. Learning to See What Do You See? What Does it Mean? > hist(iris$Petal.Length,col="grey",main=NULL) 30 Frequency 20 10 0 1 2 3 4 5 6 7 iris$Petal.Length Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 9 / 46

  10. Learning to See What Do You See? What Does it Mean? > ipl<-iris$Petal.Length > hist(ipl,prob="true",ylim=c(0,1),main=NULL) > lines(density(ipl[iris$Species=="setosa"]),col="red") > lines(density(ipl[iris$Species=="versicolor"]),col="green") > lines(density(ipl[iris$Species=="virginica"]),col="blue") Looking at density curves 1.0 for each species, we see 0.8 that the histogram did 0.6 Density indeed indicate two or more 0.4 0.2 separate populations (species). 0.0 1 2 3 4 5 6 7 ipl Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 10 / 46

  11. Learning to See What Do You See? What Does it Mean? > boxplot(iris$Petal.Length~iris$Species, ylab="Petal.Length",xlab="Species") 7 6 A boxplot by species con- 5 Petal.Length firms, and summarizes the 4 petal length statistics for 3 ● each species. 2 1 ● setosa versicolor virginica Species Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 11 / 46

  12. Learning to See Boxplots outliers whisker ¡(various ¡interpreta1ons) upper ¡quar,le ¡(75% ¡quan,le) interquar,le ¡range median lower ¡quar,le ¡ ¡(25% ¡quan,le) whisker outliers Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 12 / 46

  13. Learning to See Median, Quartiles, Interquartile Range If you sort the values in a sample from lowest to highest, the median is the middle value, or the average of the two middle values when the sample contains an even number of points. The median is the 50th quantile, or the value for which 50% of the values are greater. The lower quartile is the 25th quantile, above which 75% of the values are found. The upper quartile is the 75th quantile, above which 25% of the values are found. The interquartile range is a measure of variability and is the difference between the upper and lower quartiles. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 13 / 46

  14. Learning to See Median, Quartiles, Interquartile Range The median is robust against outliers ; the mean is not. Suppose 100 families in a neighborhood each make $40,000/year. When a millionaire moves in the mean jumps from $40,000 to $49,504/year. What happens to the median? Before the millionaire arrived, the variance in income was zero. Afterwards the variance is over nine billion !!! What happens to the interquartile range? Suppose you have a sorted sample of 9 elements; the median is the fifth element. If you add another element, what will the median be? By how many locations in the sorted distribution can the median shift? Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 14 / 46

  15. Learning to See Symmetry and Skew > with(blast, hist(Test0,breaks=20,col="grey",main=NULL)) > with(treering, hist(width,breaks=20,col="grey",main=NULL)) 40 40 30 30 Frequency Frequency 20 20 10 10 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 40 60 80 100 120 Test0 width Test0 is skewed to the right, meaning it has a long tail of higher values, while Treering is nearly symmetric. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 15 / 46

  16. Learning to See Transformations > attach(blast) > hist(Train0,breaks=20,col="grey",main=NULL) > Train0Squared<-Train0^2 #square the Train0 data > hist(Train0Squared,breaks=20,col="grey",main=NULL) 30 40 25 30 20 Frequency Frequency 15 20 10 10 5 0 0 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 Train0 Train0Squared A simple transformation amplifies an otherwise hidden feature Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 16 / 46

  17. Learning to See Transformations > Train0Squared<-with(blast,Train0^2) > with(blast,plot(density(Train0Squared))) > with(blast,lines(density(Train0Squared[gender=="female"]),c > with(blast,lines(density(Train0Squared[gender=="male"]),col density.default(x = Train0Squared) 2.5 2.0 Density 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0 N = 187 Bandwidth = 0.04378 Does gender explain the bump? Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 17 / 46

Recommend


More recommend