statistics toolbox in
play

Statistics Toolbox in A Review of Analysis Techniques for - PowerPoint PPT Presentation

Statistics Toolbox in A Review of Analysis Techniques for Scientific Research Professional Development Opportunity for the Flow Cytometry Core Facility December 7, 2018 LKG Consulting Email: consulting.lkg@gmail.com Website:


  1. The Lentil datasets (You are now a farmer) C A B Plot 1 Variety in each B A C A C B C A Farm 1 C A B A B C Farm 2 A C B Example research questions : C A Do yield of different lentil varieties differ at 2 farms? Individual lentil plants Do the varieties differ among themselves? Does the density of the plants impact their average height?

  2. Datasets Available in R • Over 100+ datasets available for you to use • We will use: iris : The famous (Fisher's or Anderson's) iris data set gives the sepal and • petal measurements for 50 flowers from each of Iris setosa, versicolor , and virginica . USArrests : Data on US Arrests for violent crimes by US State. •

  3. Hypothesis s Testing “Statistics are not substitute for judgment.” Henry Clay (Former US Senator)

  4. Formal hypotheses testing population A Is this a difference B due to random Mean height chance? A B sample Population sample 𝐼 𝑝 : ҧ 𝑦 𝐵 = ҧ 𝑦 𝐶 If actual p <  , reject null hypothesis ( 𝐼 𝑝 ) and accept 𝐼 1 : ҧ 𝑦 𝐵 ≠ ҧ 𝑦 𝐶 alternative hypothesis ( 𝐼 1 )

  5. A “Is this difference due to random chance?” B Mean height In other words : “Is random chance a plausible explanation?” Population sample P-value – the probability the observed value or larger is due to random chance Theory : We can never really prove if the 2 samples are truly different or the same – only ask if what we observe (or a greater difference) is due to random chance How to interpret p-values: P-value = 0.05 – “Yes, 1 out of 20 times.” P-value = 0.01 – “Yes, 1 out of 100 times.” The lower the probability a difference is due to random chance – the more likely is the result of an effect (what we test for)

  6. Null hypothesis is true Alternative hypothesis is true Incorrect Fail to reject Type I Error – reject the null hypothesis (H 0 ) hypothesis ☺  Correct Decision the null Decision False Negative when it is actually true Type II Error Incorrect Type II Error – failing to reject the null Reject the hypothesis  ☺ Decision null Correct Decision hypothesis (H 0 ) when it is not true False Positive Type I Error Remember rejection or acceptance of a p-value ( and therefore the chance you will make an error ) depends on the arbitrary  -level you choose  -level will probability of making a Type I Error , but this • the probability of making a Type II Error The  -level you choose is completely up to you ( typically it is set at 0.05), however, it should be chosen with consideration of the consequences of making a Type I or a Type II Error . Based on your study, would you rather err on the side of false positives or false negatives?

  7. Example: Will current forests adequately protect genetic resources under climate change? H O : Range of the current climate for the BMW protected area = Range of the BMW protected area under climate change H a : Range of the current climate for the BMW protected area ≠ Range of the BMW protected area under climate change If we reject H O : Climates ranges are different, therefore genetic resources are not adequately protected and new Birch Mountain Wildlands protected areas need to be created Consequences if I make: • Type I Error: Climates are actually the same and genetic resources are indeed adequately protected in the BMW protected area – we created new parks when we didn’t need to • Type II Error : Climates are different and genetic resources are vulnerable – we didn’t create new protected areas and we should have From an ecological standpoint it is better to make a Type I Error, but from an economic standpoint it is better to make a Type II Error Which standpoint should I take?

  8. Statistical Power Power is your ability to reject the null hypothesis when it is false (i.e. your ability to detect an effect when there is one). There are many ways to increase power: 1. Increase your sample size (sample more of the population) Given you are testing whether or not what you observed or greater is due to random chance, more data gives you a better understanding of what is truly happening within the population, therefore sample size will the probability of making a Type 2 Error 2. Increase your alpha value (e.g. from 0.01 to 0.05) – watch for Type I Error! 3. Use a one-tailed test (you know the direction of the expected effect) 4. Use a paired test (control and treatment are same sample)

  9. BREAK 9:45 – 10:00 Go grab a coffee. Next we will cover specific tools in your new tool box.

  10. Statistics s Too oolbox Par arametri ric ver ersus Non on-Parametri ric Tes ests “He uses statistics as a drunken man uses lamp posts, for support rather than illumination.” Andrew Lang (Scottish poet)

  11. Uni Univari riate Tes est t Optio ions Type Parametric Non-Parametric • • Characteristics Analysis to test group means Analysis to test group medians • • Based on raw data Based on ranked data • • More statistical power than non- Less statistical power parametric tests • • Assumptions Independent samples Independent samples • Normality (data OR errors) • Homogeneity of variances • • When to use? Parametric assumptions are met Parametric assumptions are not • Non-Normal, BUT larger sample met • size (CLT), however equal Medians better represent your data variances must be met (skewed data distribution) • Small sample-size • Ordinal data, ranked data, or outliers that you can’t remove • • Examples T-test Wilcox Rank Sum Test • • ANOVA (One-way, Two-way, Kruskal-Wallis Test • Paired) Permutational Tests (non- traditional)

  12. Assumption #1: Independence of samples “ Your samples have to come from a randomized or randomly sampled design. ” • Meaning rows in your data do NOT influence one another • Address this with experimental design (3 main things to consider) 1. Avoid pseudoreplication and potential confounding factors by designing your experiment in a randomized design 2. Avoid systematic arrangements which are distinct pattern in how treatments are laid out. • If your treatments effect one another – the individual treatment effects could be masked or overinflated 3. Maintain temporal independence • If you need to take multiple samples from one individual over time record and test your data considering the change in time (e.g. paired tests) NOTE : ANOVA needs to have at least 1 degree of freedom – this means you need at least 2 reps per treatment to execute and ANOVA Rule of Thumb : You need more rows then columns

  13. The Normal Distribution The base of parametric statistics 𝑜 𝑦 2 𝑡 2 = σ 𝑗=1 𝑦 𝑗 − ҧ Based on this curve: 𝑜 − 1 • 68.27% of observations are within 1 stdev of ҧ 𝑦 • 95.45% of observations are within 2 stdev of ҧ 𝑦 • 99.73% of observations are within 3 stdev of ҧ 𝑦 𝑡 2 SD = For confidence intervals: • 95% of observations are within 1.96 stdev of ҧ 𝑦

  14. ҧ ҧ ҧ ҧ Assumption #2: Data/Experimental errors are normally distributed “ If I was to repeat my sample repeatedly and calculate the means, those means would be normally distributed. ” Residuals 𝑦 𝐵𝐶𝐷 𝐺𝑏𝑠𝑛1 FARM 1 C C B A A 𝑦 𝐶 𝐺𝑏𝑠𝑛1 𝑦 𝐵 𝐺𝑏𝑠𝑛1 𝑦 𝑑 𝐺𝑏𝑠𝑛1 Determine if the assumption is met by : 1. Looking at the residuals of your sample 2. Shaprio – wilks Test for Normality – if your data is mainly unique values 3. D'Agostino-Pearson normality test – if you have lots of repeated values 4. Lilliefors normality test – mean and variance are unknown

  15. Assumption #2: Data/Experimental errors are normally distributed You may not need to worry about Normality? Central Limit Theorem: “ Sample means tend to cluster around the central population value .” Therefore : • When sample size is large, you can assume that ҧ 𝑦 is close to the value of 𝜈 • With a small sample size you have a better chance to get a mean that is far off the true population mean Normal distribution t-distribution (sampling distribution)

  16. Assumption #2: Data/Experimental errors are normally distributed You may not need to worry about Normality? Central Limit Theorem: “ Sample means tend to cluster around the central population value .” Therefore : • 𝑦 is close to the value of 𝜈 When sample size is large, you can assume that ҧ • With a small sample size you have a better chance to get a mean that is far off the true population mean What does this mean? • For large N, the assumption for Normality can be relaxed • You may have decreased power to detect a difference among groups, BUT your test is not really compromised if your residuals are not normal • Assumption of Normality is important when: 1. Very small N 2. Data is highly non-normal 3. Significant outliers are present 4. Small effect size

  17. ҧ ҧ Assumption #3: Equal variances between groups/treatments 𝑦 𝐵 = 12 𝑡 𝐵 = 4 𝑦 𝐶 = 12 𝑡 𝐶 = 6 0 4 8 12 16 20 24 Let’s say 5% of the A data fall above this threshold But >5% of the B data fall above the same threshold So with larger variances, you can expect a greater number of observations at the extremes of the distributions This can have real implications on inferences we make from comparisons between groups

  18. ҧ ҧ ҧ ҧ Assumption #3: Equal variances between treatments “ Does the know probability of observations between my two samples hold true? ” Residuals 𝑦 𝐵𝐶𝐷 𝐺𝑏𝑠𝑛1 FARM 1 C C B A A 𝑦 𝐶 𝐺𝑏𝑠𝑛1 𝑦 𝐵 𝐺𝑏𝑠𝑛1 𝑦 𝑑 𝐺𝑏𝑠𝑛1 Determine if the assumption is met by : 1. Looking at the residuals of your sample 2. Bartlett Test

  19. Assumption #3: Equal variances between treatments Testing for Equal Variances – Residual Plots Predicted values • NORMAL distribution: equal number of points along observed • EQUAL variances: equal spread on either side of the mean predicted 0 value =0 • Good to go! Observed (original units) Predicted values • NON-NORMAL distribution: unequal number of points along 0 observed • EQUAL variances: equal spread on either side of the mean predicted value =0 • Optional to fix Observed (original units) Predicted values • NORMAL/NON NORMAL: look at histogram or test • 0 UNEQUAL variances: cone shape – away from or towards zero • This needs to be fixed for ANOVA (transformations) Observed (original units) Predicted values 0 • OUTLIERS: points that deviate from the majority of data points • This needs to be fixed for ANOVA (transformations or removal) Observed (original units)

  20. Analysis of Variance (ANOVA) – Vocabulary • Treatment – predictor variable (e.g. variety, fertilization, irrigation, etc.) • Treatment level – groups within treatments (e.g. A,B,C or Control, 1xN, 2xN) • Covariate – undesired, uncontrolled predictor variable, confounding • F-value – 𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑢𝑠𝑓𝑏𝑢𝑛𝑓𝑜𝑢 𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑥𝑗𝑢ℎ𝑗𝑜 = 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑓𝑠𝑠𝑝𝑠 • P-value – probability that the observed difference or larger in the treatment means is due to random chance

  21. ҧ ҧ ҧ ҧ ҧ Analysis of Variance (ANOVA) Residuals 𝑦 𝐵𝐶𝐷 𝐺𝑏𝑠𝑛1 FARM 1 C C B A A SIGNAL NOISE 𝑦 𝐶 𝐺𝑏𝑠𝑛1 𝑦 𝐵 𝐺𝑏𝑠𝑛1 𝑦 𝑑 𝐺𝑏𝑠𝑛1 𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑢𝑠𝑓𝑏𝑢𝑛𝑓𝑜𝑢 𝑡𝑗𝑕𝑜𝑏𝑚 F-value – 𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑥𝑗𝑢ℎ𝑗𝑜 = = 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑓𝑠𝑠𝑝𝑠 𝑜𝑝𝑗𝑡𝑓 𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑗 𝑜 𝑦 𝐵𝑀𝑀 2 σ 𝑗 σ 𝑗 𝑦 𝑗 − ҧ 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜 = 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 = ∗ 𝑠 𝑜 𝑜 − 1

  22. Analysis of Variance (ANOVA) Think of Pac-man! • All of the dots on the board represent the Total Variation in your study • Every treatment you use in your analysis is a different Pac-man player on the board • The amount of dots each player eat represents variation between (e.g. the amount of variation each treatment can explain) • The amount of dots left on the board after all players have died represented the variation within 𝐺 = 𝑡𝑗𝑕𝑜𝑏𝑚 𝐺 = 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑜𝑝𝑗𝑡𝑓 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜 • If players have a big effect they will eat more dots, reducing dots left on the board (lowering variation within), increasing the F-value • A large F-value indicates a significant difference

  23. ҧ F Distribution (family of distributions) 𝐺 = 𝑡𝑗𝑕𝑜𝑏𝑚 𝐺 = 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑡𝑗𝑕𝑜𝑏𝑚 < 𝑜𝑝𝑗𝑡𝑓 𝑡𝑗𝑕𝑜𝑏𝑚 > 𝑜𝑝𝑗𝑡𝑓 𝑜𝑝𝑗𝑡𝑓 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜 𝑜 𝑦 𝐵𝑀𝑀 2 σ 𝑗 𝑦 𝑗 − ҧ 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 = ∗ 𝑜𝑝𝑐𝑞𝑢 𝑜 − 1 𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑗 σ 𝑗 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜 = probability 𝑜 ∝= 0.05 ∞ pf ( F, 𝑒𝑔 1 , 𝑒𝑔 2 ) qf ( p, 𝑒𝑔 1 , 𝑒𝑔 2 ) P-value (percentiles, probabilities) 0 0.50 0.95

  24. How to report results from an ANOVA Source of Variation df Sum of Squares Mean Squares F-value P-value Variety (A) 2 20 10 0.0263 0.9741 Farm (B) 1 435243 43243 1125.7085 <0.05 Variety x Farm (AxB) 2 253561 126781 327.9046 <0.05 Error 18 6959 387

  25. How to report results from an ANOVA pt(F A , df A , df ERROR ) pt(F B , df B , df ERROR ) pt(F AxB , df AxB , df ERROR ) Source of Variation df Sum of Squares Mean Squares F-value P-value Variety (A) 2 20 10 0.0263 0.9741 Farm (B) 1 435243 43243 1125.7085 <0.05 Variety x Farm (AxB) 2 253561 126781 327.9046 <0.05 Error 18 6959 387 MS A /MS ERROR MS B /MS ERROR MS A * df A MS AxB /MS ERROR MS B * df B MS ERROR * df ERROR MS AxB * df AxB

  26. How to report results from an ANOVA Source of Variation df Sum of Squares Mean Squares F-value P-value Variety (A) 2 20 10 0.0263 0.9741 Farm (B) 1 435243 43243 1125.7085 <0.05 Variety x Farm (AxB) 2 253561 126781 327.9046 <0.05 Error 18 6959 387 If the interaction is significant – you should ignore the main effects because the story is not that simple!

  27. Interaction plots – Different story under different conditions A 1. • VARIETY is significant (*) • FARM is significant (*) Yield B • FARM2 has better yield than FARM1 • No Interaction Farm1 Farm2 • VARIETY is not significant 2. A • FARM is significant (*) Yield • VARIETY A is better on FARM2 and VARIETY B is better on Avg Avg Farm1 Farm2 FARM1 B • Significant Interaction Farm1 Farm2 • VARIETY is significant (*) A 3. • FARM is significant (*) – small difference Yield • Main effects are significant, BUT hard to interpret with B overall means • Significant Interaction Farm1 Farm2 4. • VARIETY is not significant • FARM is not significant Yield A • Cannot distinguish a difference between VARIETY or FARM B • No Interaction Farm1 Farm2

  28. Interaction plots – Different story under different conditions • An interaction detects non-parallel lines • Difficult to interpret interaction plots for more than a 2- WAY ANOVA • If the interaction effect is NOT significant then you can just interpret the main effects • BUT if you find a significant interaction you don’t want to interpret main effects because the combination of treatment levels results in different outcomes

  29. Pairwise comparisons – What to do when you have an interaction a.k.a Pairwise t-tests Number of comparisons: Lentil Example: 3 VARITIES (A, B, and C) 𝐷 = 𝑢 𝑢 − 1 2 A – B 𝑢 = 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑢𝑠𝑓𝑏𝑢𝑛𝑓𝑜𝑢 𝑚𝑓𝑤𝑓𝑚𝑡 𝐷 = 𝑢(𝑢 − 1) = 3(2) A – C = 𝟒 2 2 B – C Probability of making a Type I Error in at least one comparison = 1 – probability of making no Type I Error at all Lentil Example: 𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 1 − 0.95 3 Experiment-wise Type I Error for  = 0.05: 𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 1 − 0.87 𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 1 − 0.95 𝐷 𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 𝟏. 𝟐𝟒 Significantly increased probability of making an error! Therefore pairwise comparisons leads to compromised experiment-wise  -level You can correct for multiple comparisons by calculating an adjusted p-value (Bonferroni, Holms, etc.)

  30. Pairwise comparisons – Tukey Honest Differences (Test) a.k.a Pairwise t-tests with adjusted p-values If we have NO significant interaction effect – we can just look at the main effects If we have a significant interaction effect – use these values Only need to consider relevant pairwise comparisons – think about it logically

  31. How to report a significant difference in a graph Create a matrix of significance and use it to code your graph B W X Y Z W - NS * NS A,B X - * NS Y - NS Z - A A Same letter = non significant Different letter = significant W X Y Z

  32. Plotting “Bars” – What do you want to show in your graph? Standard Error Standard Deviation “How confident are we in our statistic?” “How much dispersion from the average exists?” Standard error – standard deviation of a Standard deviation – the amount of variation or statistic. dispersion within a set of data values. Standard error of the mean - reflects the 𝑜 𝑦 2 𝑡 2 = σ 𝑗=1 𝑦 𝑗 − ҧ overall distribution of the means you would get from repeatedly resampling. 𝑜 − 1 𝑇𝐹 𝑦 = 𝑡 𝑡 2 s = 𝑜 Small values = the more representative the Small values = data points are very close to sample will be of the overall population. the mean. Large values = the less likely the sample Large values = data points are spread out adequately represents the overall over a wide range. population.

  33. Nonparametric Tests – When assumptions fail • NPTs make no assumptions for normality, equal variances, or outliers • The lack of assumptions makes, NPTs are not as powerful as standard parametric tests • NPTs work with ranked data • If you were to repeatedly sample from the same non-normal population and repeatedly calculate the difference in rank-sums the distribution of your differences would appear normal with a mean of zero • The spread of rank-sum data (variance) is a function of your sample size (max rank value) • P-value: “What is the probability that I get a difference as big or bigger in my rank sums by random chance?” • If there are no treatment effects, the expectation is that the difference among rank-sums is zero

  34. Nonparametric Tests – When assumptions fail T-test equivalent when your data distributions are similarly shaped • Wilcoxon Signed Ranks Test – (One sample t-test) Test a hypothesis about the location (median) of a population distribution • Wilcoxon Mann-Whitney Test – (Two sample t-test) Test the null hypothesis that two populations have identical distribution functions against the alternative hypothesis that the two distribution functions differ only with respect to location (median), if at all T-test equivalent when distributions are of different shape • Komogorov-Smirnov Test (less powerful than the Wilcox rank-sum tests) – • (One-tailed t-test) Test whether or not the sample of data is consistent with a specified distribution function • (Two-tailed t-test) Test whether or not these two samples may reasonably be assumed to come from the same distribution One-way ANOVA equivalent for non-normal distributions • Kruskal Wallis Test – Tests the null hypothesis that all populations have identical distribution functions against the alternative hypothesis that at least two of the samples differ only with respect to location (median), if at all. • Interaction rules apply, so a significant interaction must be followed up by pair-wise Wilcoxon tests, comparing each of the treatment levels

  35. Permutational Non-parametric tests Calculating D (delta) & its distribution • PNPT make NO Assumptions therefore any data can be used • PNPT work with absolute differences a.k.a distances • Smaller values indicate similarity • Makes the calculations equivalent to sum-of-squares 𝐸 = 𝑡𝑗𝑕𝑜𝑏𝑚 𝐸 = 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑕𝑠𝑝𝑣𝑞𝑡 𝑜𝑝𝑗𝑡𝑓 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜 𝑕𝑠𝑝𝑣𝑞𝑡 • For our test we can compare D to an expected distribution of D the same way we do when we calculate an F-value • Use permutations (iterations) to generate the distribution of D from our raw data • Therefore shape of D distribution is dependent on your data

  36. Permutational Non-parametric tests Determining the distribution of D • After you permute this process 5000 times (your choice) a distribution of D will emerge • Shape depends on your data – may be normal or not (doesn’t matter) 2 3 10

  37. Permutational Non-parametric tests Determining the distribution of D • After you permute this process 5000 times (your choice) a distribution of D will emerge • Shape depends on your data – may be normal or not (doesn’t matter) 79 D calculations ≥ 10 4921 D calculations < 10 from permutations from permutations P-value: 79/5000 = 0.0158 2 3 10

  38. Permutational Non-parametric tests in R Permutational ANOVA Permutational ANOVA in R: The option seqs=T calculates library(lmPerm) sequential sum of squares summary(aovp(YIELD~FARM*VARIETY, seqs=T)) (similar to regular ANOVA) Pairwise Permutational ANOVA in R: Good choice for balanced out1=aovp(YIELD~FARM*VARIETY, seqs=T) designs TukeyHSD(out1) You can change the maximum number of iterations with the maxIter= option

  39. Permutational Non-parametric tests • For parametric tests we know Normal, T-distribution, F- distribution look like • Therefore we can use the standard calculations (t-value) to calculate statistics • When we violate the known distribution we need some other curve to work with • Hard to estimate a theoretical distribution that fits your data • Best solution is to permute your data to generate a distribution • Permutational non-parametric statistics are just as powerful as parametric tests • This technique is similar to bootstrapping • But bootstrap samples data rather than changing all observation classes

  40. If permutational techniques are so good why not always use them? • You say permutational non-parametric tests are as powerful as parametric statistics – YES They Are! • But they are still fairly new to statistical practices • Still unknown or not understood among many uses • Best practice is to stick with parametric statistics when you can, but when you can’t permutational tests are great options!

  41. WORK PERIOD 11:00 – 12:15 Follow the Workbook Examples for the Analyses You are Interested In. Any questions?

  42. LUNCH 12:15 – 1:00 Get some brain food – more statistics coming this afternoon!

  43. Statistics s Too oolbox Reg egressio ion “If the statistics are boring, then you've got the wrong numbers.” Edward R. Tufte (Statistics Professor, Yale University)

  44. r = correlation coefficient Correlation Coefficients range -1 to 1 Positive relationship Negative relationship No relationship r = 1 r = -1 r = 0 response response response predictor predictor predictor 1 > r > 0 -1 < r < 0 r = 0 response response response predictor predictor predictor • • • Increase in X = increase in Y Increase in X = decrease in Y Increase in X has none or no consistent effect • r = 1 doesn’t have to be a • r = - 1 doesn’t have to be a on Y one-to-one relationship one-to-one relationship

  45. Correlation Methods Comparison between methods • Pearson’s Correlation : – Requires parametric assumptions – relationship order (direction) and magnitude of the data values is determined • Kendall’s & Spearman’s Correlation : – Non-parametric (based on rank) – relationship order (direction) of the data values is determined magnitude cannot be taken from this value because it is based on ranks not raw data – Be careful with inferences made with these – Order is OK (positive vs negative) – but the magnitude is misleading • Kendall and Spearman coefficients will likely be larger than Pearson coefficients for the same data because coefficients are calculated on ranks rather then the raw data

  46. Dealing with Multiple Inferences Making inferences from tables of correlation coefficients and p- values • If we want to use multiple correlation coefficients and p-values to make general conclusions we need to be cautious about inflating our Type I Error due to the multiple test/comparisons Climate variable Correlation w/ growth (r 2 ) p-value Research Question: Does lentil growth Temp Jan 0.03 0.4700 dependent on climate? Temp Feb 0.24 0.2631 Temp Mar 0.38 0.1235 Answer (based on a cursory examination of this Temp Apr 0.66 0.0063 table) : Yes, there are significant Temp May 0.57 0.0236 relationships with temperature in April, Temp Jun 0.46 0.1465 May, July, and August at α=0.05 Temp Jul 0.86 0.0001 Temp Aug 0.81 0.0036 But this is not quite right – we need to Temp Sep 0.62 0.0669 adjust p-values for multiple inferences Temp Oct 0.43 0.1801 Temp Nov 0.46 0.1465 Temp Dec 0.07 0.4282

  47. Important to Remember Correlation DOES NOT imply causation! A relationship DOES NOT imply causation! Both of these values imply a relationship rather than one factor causing another factor value Be careful of your interpretations!

  48. Correlation vs Causation Example: If you look at historic records there is a highly significant positive correlation between ice cream sales and the number of drowning deaths Do you think drowning deaths cause ice cream sales to increase? Of course NOT! Both occur in the summer months – therefore there is another mechanism responsible for the observed relationship

  49. Linear Regression Estimate of model parameters (intercept and slope) Output from R Standard error of estimates Coefficient of determination a.k.a “Goodness of fit” Measure of how close the data are to the fitted regression line The significance of the overall relationship described by the model Tests the null hypothesis that the coefficient is equal to zero (no effect) A predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable A large p-value suggests that changes in the predictor are not associated with changes in the response

  50. Linear Regression Assumptions 1. For any given value of X, the distribution of Y must be normal • BUT Y does not have to be normally distributed as a whole 2. For any given value of X, of Y must have equal variances You can again check this by using the Shaprio Test, Bartlett Test, and residual plots on the residuals of your model (see section 4.1) No assumptions for X – but be conscious of your data The relationship you detect is obviously reflective of the data you include in your study

  51. Multiple Linear Regression Relating model back to data table Response variable (Y) Multiple linear regression: ID DBH VOL AGE DENSITY y = β 0 + β 1 *x 1 + β 2 *x 2 1 11.5 1.09 23 0.55 DENSITY = Intercept + β 1 *AGE + β 2 *VOL 2 5.5 0.52 24 0.74 β 1 , β 2 : What I need to multiply AGE and VOL by (respectively) to get the value in DENSITY (predicted) 3 11.0 1.05 27 0.56 Remember the difference between the observed and 4 7.6 0.71 23 0.71 predicted DENSITY are our regression residuals Smaller residuals = Better Model 5 10.0 0.95 22 0.63 6 8.4 0.78 29 0.63 Predictor variable 1 (x 1 ) Predictor variable 2 (x 2 )

  52. Multiple Linear Regression Output from R Estimate of model parameters ( β i values) Standard error of estimates Coefficient of determination a.k.a “Goodness of fit” Measure of how close the data are to the fitted regression line Adjusted R 2 The significance of the overall relationship described by the model Tests the null hypothesis that the coefficient is equal to zero (no effect) A predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable A large p-value suggests that changes in the predictor are not associated with changes in the response

  53. Non-Linear Regression Assumptions • NLR make no assumptions for normality, equal variances, or outliers • However the assumptions of independence (spatial & temporal) and design considerations (randomization, sufficient replicates, no pseudoreplication) still apply • We don’t have to worry about statistical power here because we are fitting relationships • All we care about is if or how well we can model the relationship between our response and predictor variables

  54. Non-Linear Regression Curve Examples There are MANY more examples you could choose from – what makes sense for your data?

  55. Non-Linear Regression Curve Fitting Procedure: 1. Plot your variables to visualize the relationship a. What curve does the pattern resemble? b. What might alternative options be? 2. Decide on the curves you want to compare and run a non-linear regression curve fitting a. You will have to estimate your parameters from your curve to have starting values for your curve fitting function 3. Once you have parameters for your curves compare models with AIC 4. Plot the model with the lowest AIC on your point data to visualize fit

  56. Non-Linear Regression Output from R Non-linear model that we fit Simplified logarithmic with slope=0 Estimates of model parameters Residual sum-of-squares for your non-linear model Number of iterations needed to estimate the parameters If you are stuck for starting points in your R code this website may be able to help: http://www.xuru.org/rt/NLR.asp Copy paste data & desired model formula

  57. Non-Linear Regression R 2 for “goodness of fit” • Calculating an R 2 is NOT APPROPIATE for non-linear regression • Why? • For linear models, the sums of the squared errors always add up in a specific manner: 𝑇𝑇 𝑆𝑓𝑕𝑠𝑓𝑡𝑡𝑗𝑝𝑜 + 𝑇𝑇 𝐹𝑠𝑠𝑝𝑠 = 𝑇𝑇 𝑈𝑝𝑢𝑏𝑚 𝑇𝑇 𝑆𝑓𝑕𝑠𝑓𝑡𝑡𝑗𝑝𝑜 • Therefore 𝑆 2 = ൗ 𝑇𝑇 𝑈𝑝𝑢𝑏𝑚 which mathematically must produce a value between 0 and 100% • But in nonlinear regression 𝑇𝑇 𝑆𝑓𝑕𝑠𝑓𝑡𝑡𝑗𝑝𝑜 + 𝑇𝑇 𝐹𝑠𝑠𝑝𝑠 ≠ 𝑇𝑇 𝑈𝑝𝑢𝑏𝑚 • Therefore the ratio used to construct R 2 is bias in nonlinear regression • Best to use AIC value and the measurement of the residual sum-of- squares to pick best model then plot the curve to visualize the fit

  58. Akaike’s Information Criterion (AIC) How do we decide which model is best? In the 1970s he used information theory to build a numerical equivalent of Occam's razor Occam’s razor : All else being equal, the simplest explanation is the best one • For model selection, this means the simplest model is preferred to a more complex one • Of course, this needs to be weighed against the ability of the model Hirotugu Akaike, 1927-2009 to actually predict anything • AIC considers both the fit of the model and the model complexity • Complexity is measured as number parameters or the use of higher order polynomials • Allows us to balance over- and under-fitting in our modelled relationships – We want a model that is as simple as possible, but no simpler – A reasonable amount of explanatory power is traded off against model complexity – AIC measures the balance of this for us • Can be calculated for any kind of model allowing comparisons across different modelling approaches and model fitting techniques – Model with the lowest AIC value is the model that fits your data best (e.g. minimizes your model residuals)

  59. Logistic Regression (a.k.a logit regression) Relationship between a binary response variable and predictor variables 𝑓 𝛾 0 +𝛾 1 𝑦 1 +𝛾 2 𝑦 2 +⋯+𝛾 𝑜 𝑦 𝑜 Logit Model 𝑀𝑝𝑕𝑗𝑡𝑢𝑗𝑑 𝑁𝑝𝑒𝑓𝑚: 𝑧 = 1 − 𝑓 𝛾 0 +𝛾 1 𝑦 1 +𝛾 2 𝑦 2 +⋯+𝛾 𝑜 𝑦 𝑜 • Binary response variable can be considered a class (1 or 0) • Yes or No • Present or Absent • The linear part of the logistic regression equation is used to find the probability of being in a category based on the combination of predictors • Predictor variables are usually (but not necessarily) continuous • But it is harder to make inferences from regression outputs that use discrete or categorical variables

  60. Logistic Regression (a.k.a logit regression) Assumptions • Logistic regression make no assumptions for normality, equal variances, or outliers • However the assumptions of independence (spatial & temporal) and design considerations (randomization, sufficient replicates, no pseudoreplication) still apply • Logistic regression assumes the response variable is binary (0 & 1) • We don’t have to worry about statistical power here because we are fitting relationships • All we care about is if or how well we can model the relationship between our response and predictor variables

  61. Binomial distribution vs Normal distribution • Key difference: Values are continuous (Normal) vs discrete (Binomial) • As sample size increases the binomial distribution appears to resemble the normal distribution • Binomial distribution is a family of distributions because the shape references both the number of observations and the probability of “ getting a success ” - a value of 1 “What is probability of x success in n independent and identically distributed Bernoulli trials?” • Bernoulli trial (or binomial trial) - a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted

  62. Regression Differences Logistic Regression Linear Regression • • References the Binomial distribution References the Gaussian (normal) distribution • Estimates the probability (p) of an • event occurring (y=1) rather then not Uses ordinary least squares to find a occurring (y=0) from a knowledge of best fitting line the estimates relevant independent variables (our parameters that predict the change in data) the dependent variable for change in the independent variable • Regression coefficients are estimated using maximum likelihood estimation (iterative process) response (y) response (y) predictor (x) predictor (x)

  63. Maximum likelihood estimation How coefficients are estimated for logistic regression • Complex iterative process to find coefficient values that maximizes the likelihood function Likelihood function - probability for the occurrence of a observed set of values X and Y given a function with defined parameters Process: 1. Begins with a tentative solution for each coefficient 2. Revise it slightly to see if the likelihood function can be improved 3. Repeats this revision until improvement is minute, at which point the process is said to have converged

  64. Logistic Regression (a.k.a logit regression) Output from R Estimate of model parameters (intercept and slope) Standard error of estimates AIC value for the model Tests the null hypothesis that the coefficient is equal to zero (no effect) A predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable A large p-value suggests that changes in the predictor are not associated with changes in the response

  65. Binomial ANOVA • Can be used when there are 2 or more predictor variables that are binomial • Uses output from generalized linear model referencing logistic regression and binomial distribution – acts like a chi-squared test • First, build a logistic model • Next input the resulting model into the ANOVA test with a specification to calculate p-values using chi-squared distribution rather than F-distribution (reserved for parametric statistics) • Provides a good indication of predictor variables to include in your logistic model

  66. Binomial ANOVA Rather than sum of squares and mean sum of squares, now the deviance and residual deviance from each parameter Remember deviance is a measure of the lack of fit between the model and data with larger values indicating poorer fit Tests the null hypothesis that the variable no effect on achieving a “success” (value of 1) A predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to achieving a success in the response variable A large p-value suggests that changes in the predictor are not associated with success in the response

  67. Odds Ratios • Determine the odds of a “success” based on the predictors and modelled relationship • The odds ratio for the model is the increase in odds above the value of the intercept when you add a unit to the predictor(s). • E.g.: One unit increase in NITROGEN increases the odds of survival in a plant by 0.22 • Odds ratios can be converted into an estimated probability of survival for a given value of the predictor(s) using the logit model equation and the coefficient estimates from the logistic model. 𝑓 𝛾 0 +𝛾 1 𝑦 1 +𝛾 2 𝑦 2 +⋯+𝛾 𝑜 𝑦 𝑜 𝑀𝑝𝑕𝑗𝑡𝑢𝑗𝑑 𝑁𝑝𝑒𝑓𝑚: 𝑧 = 1 − 𝑓 𝛾 0 +𝛾 1 𝑦 1 +𝛾 2 𝑦 2 +⋯+𝛾 𝑜 𝑦 𝑜 • E.g.: If 200 units of NITORGEN are applied to a plot there is a 40% chance of plant survival, but if NITROGEN is increased to 500 or 750 units the probability of survival increases to 77% and 93%, respectively.

  68. WORK PERIOD 2:00 – 3:00 Follow the Workbook Examples for the Analyses You are Interested In. Any questions?

  69. Ext xtension of of th the Statistics Too oolbox Mul ultiv ivaria iate Tes ests (R (Rotation-Based) “Definition of Statistics: The science of producing unreliable facts from reliable figures.” Evan Esar (Humorist & Writer)

  70. A Reminder from Univariate Statisics …. • Population – Class of things (What you want to learn about) • Sample – group representing a class (What you actually study) • Experimental unit – individual research subject (e.g. location, entity, etc.) • Response variable – property of thing that you believe is the result of predictors (What you actually measure – e.g. lentil height, patient response) a.k.a dependent variable • Predictor variable(s) – environment of things which you believe is influencing a response variable (e.g. climate, topography, drug combination, etc.) a.k.a independent variable • Error - difference between an observed value (or calculated) value and its true (or expected) value

  71. Rotation-based Methods Data.I Varable Variable Variable Variable Type … D 1 2 3 4 Experimental Unit 1 (row) 2 3 4 5 6 … Regions, Ecosystems, Forest types, Frequency of species, Treatments, etc. Climate variables, Soil characteristics, Nutrient concentrations In Multivariate statistics : Drug levels, etc. • Variables can be either numeric or categorical (depends on the technique) • Focus is often placed on graphical representation of results

  72. Rotation-based Methods Data.I Varable Variable Variable Variable Type … D 1 2 3 4 1 Variable 2 2 3 4 5 6 Variable 1 … Find an equation to rotate data Final results based on multiple to so that axis explains multiple variables give different inferences variables than 2 variables Repeat rotation process to Variable 3, 6, 8 achieve analysis objective Variable 3 Variable 1, 2, 4, 9, 10 Variable 1,2

  73. Objective of Rotation-based Methods 1. Rotate so that new axis explains the greatest amount of variation within the data Principal Component Analysis (PCA) Factor Analysis 2. Rotate so that the variation between groups is maximized Discriminatn Analysis (DISCRIM) Multivariate Analysis of Variance (MANOVA) 3. Rotate so that one dataset explains the most variation in another dataset Canonical Correspondence Analysis (CCA)

  74. The Math Behind PCA • PCA Objective: Find linear combinations of the original variables X 1 , X 2 , …, X n to produce components Z 1 , Z 2 , …, Z n that are uncorrelated in order of their importance, and that describe the variation in the original data. • Principle components are the linear combinations of the original variables • Principle component 1 is NOT a replacement for variable 1 – All variables are used to calculate each principal component For each component: Column vectors of original variables First principal component (column vector) Z 1 = a 11 X 1 + a 12 X 2 + … + a 1n X n Coefficients for linear model 2 + … + a 1n 2 + a 12 2 = 1 ensures Var(Z 1 ) is as large as possible The constraint that a 11

  75. The Math Behind PCA • Z 2 is calculated using the same formula and constraint on a 2n values However , there is an addition condition that Z 1 and Z 2 have zero correlation for the data • The correlation condition continues for all successional principle components i.e. Z 3 is uncorrelated with both Z 1 and Z 2 • The number of principal components calculated will match the number of predictor variable included in the analysis • The amount of variation explained decreases with each successional principal component • Generally you base your inferences on the first two or three components because they explain the most variation in your data • Typically when you include a lot of predictor variables the last couple of principal components explain very little (< 1%) of the variation in your data – not useful variables

  76. Data matrix of predictor variables PCA in R You will assign the results back to a class once the PCs have been calculated PCA in R: princomp(dataMatrix,cor=T/F) (stats package) Define whether the PCs should be calculated using the correlation or covariance matrix (derived within the function from the data) You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales Default is to use the correlation matrix because it standardizes to data before it calculated the PCs, removing the effect of the different units

  77. PCA in R

  78. PCA in R Eigenvectors Loadings – these are the correlations between the original predictor variables and the principal components Identifies which of the original variables are driving the principal component Example: Comp.1 – is negatively related to Murder, Assault, and Rape Comp.2 – is negatively related to UrbanPop

  79. PCA in R Scores – these are the calculated principal components Z 1 , Z 2 , …, Zn These are the values we plot to make inferences

  80. PCA in R Eignenvalues divided by the number of PCs Variance – summary of the output displays the variance explained by each principal component Identifies how much weight you should put in your principal components Example: Comp.1 – 62 % Comp.2 – 25% Comp.3 – 9% Comp.4 – 4 %

  81. PCA in R - Biplot Data points considering Comp.1 and Comp.2 scores (displays row names) Direction of the arrows +/- indicate the trend of points (towards the arrow indicates more of the variable) If vector arrows are perpendicular then the variables are not correlated If you original variables do not have some level of correlation then PCA will NOT work for your analysis – i.e. You wont learn anything!

  82. The Math Behind Discriminant Analysis (DISCRIM) • DISCRIM Objective: Rotate that data so that variation between groups is maximized (“reduce complexity”). • “What distinguishes my groups?” • Different question compared to PCA which maximizes variation explained • Discriminant functions (DF) are the linear combinations of the original variables. • Create a DF for every observation in the dataset (like PCA scores) • NOT average group measurements, but rather measurements on individuals within the pre-determined groups. Column vectors of original For each function: variables Linear discriminant (column vector) DF 1 = aX 1 + bX 2 + … + zX n a, b,… z Coefficients for linear model

  83. How DISCRIM works 1. Find the axis that gives the greatest separation between 2 groups 2. Fix that axis 3. Rotate around the fixed axis to maximize difference between first 2 groups and the 3 rd group 4. Repeat steps 2 & 3 for all groups included y y x’ x’ y’ y’ ∝ ∝ x x

  84. DISCRIM in R – MASS package output The initial probability of belonging to a group (more important for predicting class) Mean observation values for variables in each pre-defined group Coefficients of linear discriminants are the solutions to our linear functions Proportion of variance explained by linear discriminants MASS will only display solutions for the most significant linear discriminants Discriminants that explain very small portion of the variance are removed

  85. DISCRIM in R – candisc package output Proportion of variance explained by linear discriminants Mean discriminant values for each pre- defined group Standard error of the means are also given By querying the analysis structure we can see the discriminant loadings which tell us the relationship between the DF values and the original variables (like PCA) Again candisc will only display solutions for discriminants that explain the most variation Less information is displayed in the candisc output, but you can get the loadings which are important! Candisc also produces a nicer plot

Recommend


More recommend