styles of data analysis
play

Styles of data analysis DAAG Chapter 2 Objectives Learn the common - PowerPoint PPT Presentation

Styles of data analysis DAAG Chapter 2 Objectives Learn the common tools of Exploratory Data Analysis Histograms, density plots, boxplots Scatterplots and scatterplot matrices Data summaries Learn about what to look for and


  1. Styles of data analysis DAAG Chapter 2

  2. Objectives • Learn the common tools of Exploratory Data Analysis – Histograms, density plots, boxplots – Scatterplots and scatterplot matrices – Data summaries • Learn about what to look for and what can go wrong – Outliers, skewness, clustering – Non-linearity, heteroscedasticity • Be mindful of good statistical practice, overreaching, overfitting, …

  3. What is the first rule of data analysis? Plot your data!

  4. Exploratory data analysis • Formalized by John Tukey – Guiding principle: let the data speak for themselves • Why do EDA? – Suggest new ideas or understandings – Reveal problematic assumptions made before data collection – Check on assumptions to be made in subsequent analysis – Suggest future research questions or directions

  5. Plots for a single variable A: Breaks at 72.5, 77.5, ... B: Breaks at 75, 80, ... 20 20 Frequency 15 Frequency 15 10 10 5 5 0 0 75 85 95 75 85 95 Total length (cm) Total length (cm)

  6. Plots for a single variable A: Breaks at 72.5, 77.5, ... B: Breaks at 75, 80, ... 0.08 0.08 Density Density 0.04 0.04 0.00 0.00 75 85 95 75 85 95 Total length (cm) Total length (cm)

  7. Plots for a single variable 75 80 85 90 95 Total length (cm)

  8. Plots for bivariate data • Experiment with 17 tasters – Milk sample with 1 unit of sweetener – Milk sample with 4 units of sweetener • Each person rated the sweetness of the two samples

  9. Plots for bivariate data • 1:1 plot ratio 7 • Rug shows where 6 points lie on the axis 5 four • Most people think 4 “four” is sweeter 3 • There is a positive 2 relationship between ratings 2 3 4 5 6 7 one

  10. Plots for bivariate data 8000 Resistance (ohms) 6000 4000 2000 10 20 30 40 50 60 Apparent juice content (%)

  11. Plots for bivariate data 8000 Resistance (ohms) 6000 4000 2000 10 20 30 40 50 60 Apparent juice content (%)

  12. Plots for bivariate data 8000 Resistance (ohms) 6000 4000 2000 10 20 30 40 50 60 Apparent juice content (%)

  13. Plots for bivariate data 8 4000 6 log(brain) brain 4 2000 2 0 0 0 20000 60000 0 5 10 body log(body)

  14. Clustering • Heights in inches of the singers in the New York Choral 0.10 Society in 1979 0.08 Density 0.06 0.04 0.02 0.00 60 65 70 75 80 height

  15. Clustering 55 60 65 70 75 80 Soprano 2 Soprano 1 0.15 0.10 0.05 0.00 Alto 2 Alto 1 0.15 0.10 0.05 Density 0.00 Tenor 2 Tenor 1 0.15 0.10 0.05 0.00 Bass 2 Bass 1 0.15 0.10 0.05 0.00 55 60 65 70 75 80 Height (inches)

  16. Outliers • Require special treatment • Could be highly influential in subsequent modeling • Could suggest new understanding xkcd.com

  17. Conditioning plots • Earthquake data Given : depth 100 200 300 400 500 600 from a location near Fiji • Depth in km 165 170 175 180 185 165 170 175 180 185 -15 • Data since 1964 -25 -35 lat -15 -25 -35 165 170 175 180 185 long

  18. Scatterplot matrix

  19. Sparklines DAX 4133-6186 SMI 6045-8412 CAC 2858-4388 FTSE 5014-6179 EU daily closing price indices: 1998

  20. (Sparklines R code) EU <- window( EuStockMarkets, start = 1998 ) par( mfcol = c(4,1), mar = c(1,5,1,8)+0.1, oma = c(2,0,0,0) ) for( i in 1:4 ){ plot( EU[,i], axes = FALSE, xlab = "", ylab = "" ) rr <- range( EU[,i] ) mtext( paste( round(rr), collapse="-" ), 4, las = 1 ) mtext( colnames(EU)[i], 2, las = 1 ) } mtext("EU daily closing price indices: 1998",1,outer=TRUE, line=0)

  21. Summary statistics • Central tendency: Survival on the Titanic mean, median, Age Child Adult mode, … • Dispersion: No Male standard deviation, Survived Sex IQR, range, … Yes • Counts by group or Yes No Female category

  22. The data analysis process • Moving from EDA into more directed data analysis, we begin to ask questions of the data – Questions motivated by scientific understanding • Testing hypotheses • Mechanism is important – Questions motivated by a goal to predict • Prediction performance is important • Mechanism is not necessarily important

  23. Observational vs Experimental Data • Experimental data are the gold standard – Randomization allows isolation of effects – Caution about generalizing results • Observational data are abundant – Experiments are not always possible – Features and relationships are difficult or impossible to isolate

  24. Data from surveys • Are we measuring what we think we are measuring? – Large field of research – Are we measuring the population of interest? – Non-response issues – Does the question measure what we are interested in? • e.g. Would like to know whether people support handgun ownership. – Poll people leaving a sporting goods store. – Ask: “Have you considered handgun ownership for self defense?”

  25. Planning ahead • The best time to plan data analysis is before the data are collected – Preliminary data or data from another study can be used to design the analysis and experiment/survey • The reality is that we are often asked to do data analysis after the fact – Although EDA can be useful, it is important to ask directed questions of the data to avoid fishing expeditions – Sometimes, it is not possible to answer a given question using a given dataset without resorting to unreasonable assumptions

  26. Stat 862 students • Reminder to see me this week about project alternative • “Proposal” due date is Monday October 6

Recommend


More recommend