STAT 201 Defence against the Dark Arts (for the Life Sciences) Instructor: Professor Lockhart Office: 10561 E-mail: lockhart@sfu.ca Phone: 3264 Web site: http://www.stat.sfu.ca/~lockhart 1
Text: The Basic Practice of Statistics , 3rd Edition, by David S. Moore, W.H. Freeman Publishers On 2 hour reserve in library. Lectures: Tuesday 10:30 – 12:10 without break. Thursday 10:30–11:20. Help: go to Stat Workshop K9516 inside K9510. Office Hours: Thursday 11:30 – 12:30; Friday 12:30 – 13:30 2
Assignments: six – first five marked. Due: ev- ery second Tuesday by 4:30 PM in boxes out- side Stat Workshop. First due on 21 Sept. Late Assignments not accepted. Worth 20% of mark. Based on best 4 of 5 marked. Returned through Stat Workshop Exams: Midterm worth 30% on 21 October 2004. CLOSED BOOK. Final: 15 December 2004. OPEN BOOK worth 50% Make-up exams: medical note required. Missed midterm will be replaced by Final. 3
Web material: slides posted on the web when possible. Assignment questions posted. Solutions posted evening after assignment due. Midterm solu- tions to be posted eventually. Extra material: perhaps but probably not. Computing: some questions to be done using program JMP. Assignment lab accounts created. JMP avail- able on PCs and Macs in assignment lab. JMP also available in Stat Workshop. 4
Outline Univariate Descrip- 4 hours Chapters 1, 2, 3 tive Statistics Bivariate Descriptive 6 hours Chapters 4, 5, 6 Statistics Experimental Design 2 hours ? Probability 3 hours ? Binomial, Poisson 3 hours ? distributions Hypothesis Tests, 2 hours ? confidence intervals Midterm 1 hour Hypothesis Tests, 2 hours ? confidence intervals Two Sample tests 3 hours ? Inference in Regres- 3 hours ? sion 1 and 2 way ANOVA 3 hours ? Count data 3 hours ? 5
Definition : Defence against the Dark Arts is the science of Data. How should it be collected? How should it be summarized? How should it be displayed? How should it be interpreted? Where are the pitfalls? 6
Jargon Usual structure of data set. Individuals , subjects , cases , experimental units are all jargons used for the people or animals or plants or things on which measure- ments are made. Variables : the things measured. 7
Example : case by variable presentation. Data on sea urchins: Urchin ID Age Size 3997 6.91 57.5 991 0.91 9.5 2163 2.41 29.5 15 0.49 0.5 2202 2.41 30.5 2862 3.42 44.5 1575 1.41 24.5 293 0.49 2.5 358 0.49 3.5 . . . . . . . . . Comment: 9 cases (of 250) shown, 3 variables. Comment: Notice poor scientific form – no units listed for Age or Size in on-line source. 8
Example : weather in Central Park, New York for May Day Max Temp Sunshine Weather 1 72 5 18 2 75 4 1 3 65 4 NA 4 63 0 NA . . . . . . . . . Comment: “Weather” is code. ’1’ means Fog. ‘18’ not listed. Comment: NA means not available. Comment ‘5’ for sunshine means partly cloudy. ‘0’ is clear. 9
Jargon : variable types. Nominal : categories with no particular order. Examples: Variable Sex has 2 “levels”: Male and Female. Variable Eye colour has levels like “blue”, “hazel”, “brown”. Ordinal : categories with an order. Examples: 5 point scales: “Paul Martin is do- ing a good job: Strongly Agree, Agree, Neu- tral, Disagree, Strongly Disagree” Sunshine in NY: 0 is most sunny, 10 is most cloudy. Categorical : either Nominal or Ordinal . Also called “qualitative”. 10
Quantitative : numerical variable like value in $, age, height, weight. Interval : quantitative variable for which dis- tance from 1 to 2 is same as from 3 to 4. Ratio : Like Interval but with a natural value for 0. Discrete : Used for both Categorical variables and for variable with only integer values. Continuous : values between integers (in prin- ciple as finely measured as desired) Examples: Mass is ratio, temperature in de- grees Celsius is interval, number of murders in a week in Vancouver is quantitative but dis- cretes, temperature is continuous.. Note: 5 point scales (“Likert” scales in Psy- chology) often assigned numbers say 1-5 or 0- 4. But is difference between “Strongly agree” and “agree” same as between “agree” and “neu- tral”? 11
Why the jargon? Sometimes helps identify suitable methods of data presentation, summarization and analysis. WARNING: many different forms of statistical jargon in use in different disciplines. Social Sciences: use nominal , ordinal , inter- val and ratio . Math Stat: use categorical , quantitative , dis- crete continuous . WARNING: all labels are sometimes open to debate. Is money “discrete”? (Integer num- ber of pennies but huge number of possible values.) 12
Data Collection Exercise: VOLUNTARY On blank sheet of paper please provide: 1) Height 2) Weight 3) Sex 4) Value of Coins in pocket / purse 5) SFU credits completed. PLEASE DO NOT PUT YOUR NAME ON THIS. Give to me at end of class or put in box outside Stat Workshop. PURPOSE: provide data set to display and summarize 13
Univariate Descriptive Statistics Displays: pie charts, bar graphs, box plots, his- tograms, density estimates, dot plots, stem- leaf plots, tables, lists. Example : sea urchin sizes Boxplot Histogram 60 60 50 Number of Urchins 50 Urchin Size (mm) 40 40 30 30 20 20 10 10 0 0 0 10 20 30 40 50 60 70 Urchin Size (mm) Dot Plot Density 0.015 Density 0.010 0.005 0.000 0 10 20 30 40 50 60 −20 0 20 40 60 80 Urchin Size (mm) Urchin Size (mm) 14
Points: 1) Useful for quantitative variables. 2) Boxplot shows five point summary: mini- mum, first quartile, median, third quartile, max- imum. 3) Dot Plot illegible with 250 data points. (1 dot for each size plotted on line.) 4) Histogram, density plot serve similar pur- poses. 5) Density goes below 0: bad. 6) Histogram doesn’t show clustering density plot shows. 15
Example : Categorical: Weather in Central Park Pie Chart Bar Graph 10 clear 8 6 partly.cloudy 4 cloudy 2 0 clear partly.cloudy cloudy Pie chart harder to read. General summary: Pie Charts are bad. More useful with more categories. Ordering of categories important for nominal variables. Cloudiness is ordinal. 16
Pie charts: wedge has area proportional to # of individuals in category. Bar chart: bar has height equal to # of indi- viduals in category. Density estimates not discussed in this course. Histogram: 1) divide range of values into intervals. 2) Count numbers of individuals in each inter- val. 3) bar AREA is proportional to # of individuals in interval; width is length of interval. 4) equal width bars best – then height propor- tional to # of individuals. 5) label x -axis; include units. 6) label y -axis. 17
Example : Personal Income for BC (ages 15+). (For those with income.) Source: 2001 Cen- sus. Adult Personal Income (BC) 0.03 0.02 0.01 0.00 0 20 40 60 80 100 Income ($000s) 18
Points 1) Bar widths unequal – census tables given that way. 2) So take width times height to get area = fraction of population in that income group. 3) Last group on right open ended – artificially cut off at $100,000 by me. 4) Plot is “long-tailed to the right” or “skewed to the right”. 5) Based on 20% sample of 1,523,720 people aged 15 + in BC on census day, 2001. 19
Comparison of 1996, 2001. 1996 Income Density 0 20 40 60 80 100 2001 Income Density 0 20 40 60 80 100 20
Summarizing the pictures. Purposes: less space in text than a graph; pre- cise numerical comparison between groups. Summarizing a histogram: Where is centre of the x -axis values? Jargon: location or centre . How far do the x values extend on either side? Jargon: spread , variation , width . Is the picture symmetric or does it extend far- ther to right than left? Location and number of bumps. 21
Measures of location: Mean , Arithmetic Mean , Average , Arith- metic Average : total of x -values divided by number of x values. Histogram balances at mean. ( First Moment in physics.) Think of See-Saw: small kid far from centre balances big kid close to centre. Formula: data X 1 , . . . , X n . � n i =1 X i ¯ X = n Utility of summation notation in this course: NIL. But ¯ X is standard notation for average of X . Median : number such that 1/2 of X values at least that large, and 1/2 of X values at least that small. Sort list: if n is odd median is middle of sorted list. If n is even take average of two middle values. 22
Numerical examples: ages in my family: 50 , 50 , 20 , 15 , 8 , 8 . A = 50 + 50 + 20 + 15 + 8 + 8 = 151 ¯ ≈ 25 . 2 6 6 Median age: middle numbers are 15, 20. Halfway between is median = 17.5. Mode : most common value. Not useful con- cept in most cases. Location of tallest bar in histogram (affected by definition of classes). Mode of ages is not unique: 50 or 8. Not useful summary of centre. 23
Comparison: Advantages of mean: 1) if your average weekly income is $100 you know how you will do in the long run; not so if median weekly income is $100. 2) Same point: average and sample size tells you total. 3) Has simpler mathematical behaviour than median. Advantages of median: Not influenced by extreme members of list. Median income, for instance, gives more infor- mation about typical person. 24
Recommend
More recommend