statistically significant correlations 11 oct 2014
play

Statistically-Significant Correlations 11 Oct, 2014 0F 2014 NNN4 - PDF document

Statistically-Significant Correlations 11 Oct, 2014 0F 2014 NNN4 Statistically-Significant Correlations 1 0F 2014 NNN4 Statistically-Significant Correlations 2 Statistically-Significant Exact Solutions Correlations For N random pairs


  1. Statistically-Significant Correlations 11 Oct, 2014 0F 2014 NNN4 Statistically-Significant Correlations 1 0F 2014 NNN4 Statistically-Significant Correlations 2 Statistically-Significant Exact Solutions Correlations For N random pairs from an uncorrelated bivariate Milo Schield normally-distributed distribution, the sampling distribution is not simple. Augsburg College Here are three common analytic approaches: Editor of www.StatLit.org 1.Fisher transformation (using LN and Arctanh), US Rep: International Statistical Literacy Project 2.an exact solution (using a Gamma function), or 3.Student-t distribution: t = r Sqrt[( n -2)/(1- r ^2)]; df= n -2 Fall 2014 • For large n , the critical value of t (95% confidence) is 1.96. National Numeracy Network Conference • For small n, the critical value of t increases as n decreases. www.StatLit.org/pdf/2014-Schield-NNN4-Slides.pdf None of these are simple or memorable. 0F 2014 NNN4 Statistically-Significant Correlations 3 0F 2014 NNN4 Statistically-Significant Correlations 4 Simple Model: 2/SQRT(n) Sufficient Condition Approach: Find an equation generating a minimum Minimum Correlation for Statistical Significance correlation for statistical-significance given N. N Exact 2/sqrt(n) Error 400 0.10 0.10 3.0% 1. Given N, find the smallest value of r where the left 256 0.12 0.13 2.7% end of a 95% confidence interval is non-negative. 100 0.20 0.20 2.0% 49 0.28 0.29 1.7% Use calculator at www.vassarstats.net/rho.html or 25 0.40 0.40 1.3% www.danielsoper.com/statcalc3/calc.aspx?id=44 16 0.50 0.50 1.0% For Daniels, use the results for a two-tailed test. 12 0.57 0.58 0.6% 10 0.63 0.63 0.4% 2. Generate correlation coefficient with simple model 7 0.75 0.76 0.4% 6 0.81 0.82 0.6% 3. Calculate error difference between calculated and 5 0.88 0.89 1.4% exact using the exact as the standard. If all errors are 4 0.96 1.00 4.0% positive, then the model is sufficient. All errors positive means the model is sufficient. 0F 2014 NNN4 Statistically-Significant Correlations 5 0F 2014 NNN4 Statistically-Significant Correlations 6 Time-Series Correlations Solution www.tylervigen.com Minimum statistically-significant r = 2/Sqrt(n) Tangled bed ‐ sheet Deaths vs. Skiing Revenues “n” is the number of pairs being correlated 900 2600 Revenues: Blue line Less than 5% over for n between 5 and 4,000. 800 2400 Revenues ($M) Simple and memorable for two variables. Deaths (US) 700 2200 600 2000 500 1800 It is similar to the formula for the maximum 95% Correlation: 0.969724 400 1600 Margin of Error in samples from a binary variable: 300 1400 95% ME = 1.96 Sqrt[p*(1-p)/n] < 2 Sqrt[1/(4n)] 2000 2002 2004 2006 2008 Source: http://tylervigen.com/view_correlation?id=1864 95% ME < 1/Sqrt(n) Simple and memorable for one binary variable. 10 pairs; 2/Sqrt(10) = 0.63; Statistically significant 2014-Schield-NNN4-slides.pdf 1

  2. Statistically-Significant Correlations 11 Oct, 2014 0F 2014 NNN4 Statistically-Significant Correlations 7 0F 2014 NNN4 Statistically-Significant Correlations 8 Correlation = -0.993 Correlation = 0.664 Bee colonies & MJ arrests Drownings & Cage films Pool Drownings vs. Films with Nicholas Cage Bee Colonies vs. Juvenile Marijuana Arrests 130 4.5 3,400 Correlation: 0.666004 95,000 4 120 3,200 85,000 3.5 Marijuna Arrests of Juveniles Drowningns: Cage films: 75,000 Drownings Cage films 3 3,000 110 Colonies (K) Red line Blue line 65,000 Arrests 2.5 Correlation: ‐ 0.933389 2,800 100 55,000 2 1.5 Honey Bee Colonies 45,000 2,600 90 35,000 1 2,400 80 0.5 25,000 1998 2000 2002 2004 2006 2008 2010 2,200 15,000 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 www.tylervigen.com/view_correlation?id=359 www.tylervigen.com/view_correlation?id=1582 11pairs; 2/Sqrt(11) = 0.60; Statistically-significant! 20 pairs; 2/Sqrt(20) = 0.45; Statistically-significant 0F 2014 NNN4 Statistically-Significant Correlations 9 0F 2014 NNN4 Statistically-Significant Correlations 10 Correlation = 0.664 Something Seems Wrong! Drownings & Cage films 1. There is nothing linear about these associations. 2. These correlations seem unbelievably high. ----------------------- #1: The correlation between two time-series eliminates the common factor: time. The question is whether their mutual association is linear. To see this, an XY-plot is generated. 0F 2014 NNN4 Statistically-Significant Correlations 11 0F 2014 NNN4 Statistically-Significant Correlations 12 #2: Very High Correlations. Conclusions Three Explanations 1. Association is causal . See Tyler Vigen’s video: 1. Use 2/Sqrt(n) as the minimum correlation for www.youtube.com/watch?feature=player_embedded&v=g-g0ovHjQxs statistical significance. This criteria is sufficient, 2. Association is spurious – just random chance . fairly accurate (within 5%) and memorable. Five percent of random associations will be 2. The correlation between two time-series eliminates mistakenly classified as statistically significant. time. Correlation determines the degree of linearity 3. Association is cherry-picked -- after the fact . in their cross-sectional association. According to Tyler, “This server has generated 3. Do not use a test for statistical significance if the 24,470 correlations.” Tyler just picked those data pairs were selected – after the fact via data with high or interesting correlations. mining – solely because of their high correlation. 2014-Schield-NNN4-slides.pdf 2

  3. Statistically-Significant Correlations 11 Oct, 2014 0F 2014 NNN4 Statistically-Significant Correlations 13 Correlation = 0.993 Divorce & Margarine Usage Divorce Rate vs. Margarine Consumption 5 9 Correlation: 0.992558 8 4.8 7 4.6 6 Margarine Consumption per capita (US) 4.4 5 Divorce Rate 4.2 4 in Maine 4 3 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Correlation: 0.993. N=10, SS_Rho = 1/sqrt(11) = 10 pairs; 2/Sqrt(10) = 0.64. Statistically-significant www.tylervigen.com/view_correlation?id=1703 2014-Schield-NNN4-slides.pdf 3

  4. 0F 2014 NNN4 Statistically-Significant Correlations 1 Statistically-Significant Correlations Milo Schield Augsburg College Editor of www.StatLit.org US Rep: International Statistical Literacy Project Fall 2014 National Numeracy Network Conference www.StatLit.org/pdf/2014-Schield-NNN4-Slides.pdf

  5. 0F 2014 NNN4 Statistically-Significant Correlations 2 Exact Solutions For N random pairs from an uncorrelated bivariate normally-distributed distribution, the sampling distribution is not simple. Here are three common analytic approaches: 1.Fisher transformation (using LN and Arctanh), 2.an exact solution (using a Gamma function), or 3.Student-t distribution: t = r Sqrt[( n -2)/(1- r ^2)]; df= n -2 • For large n , the critical value of t (95% confidence) is 1.96. • For small n, the critical value of t increases as n decreases. None of these are simple or memorable.

  6. 0F 2014 NNN4 Statistically-Significant Correlations 3 Sufficient Condition Approach: Find an equation generating a minimum correlation for statistical-significance given N. 1. Given N, find the smallest value of r where the left end of a 95% confidence interval is non-negative. Use calculator at www.vassarstats.net/rho.html or www.danielsoper.com/statcalc3/calc.aspx?id=44 For Daniels, use the results for a two-tailed test. 2. Generate correlation coefficient with simple model 3. Calculate error difference between calculated and exact using the exact as the standard. If all errors are positive, then the model is sufficient.

  7. 0F 2014 NNN4 Statistically-Significant Correlations 4 Simple Model: 2/SQRT(n) Minimum Correlation for Statistical Significance N Exact 2/sqrt(n) Error 400 0.10 0.10 3.0% 256 0.12 0.13 2.7% 100 0.20 0.20 2.0% 49 0.28 0.29 1.7% 25 0.40 0.40 1.3% 16 0.50 0.50 1.0% 12 0.57 0.58 0.6% 10 0.63 0.63 0.4% 7 0.75 0.76 0.4% 6 0.81 0.82 0.6% 5 0.88 0.89 1.4% 4 0.96 1.00 4.0% All errors positive means the model is sufficient.

  8. 0F 2014 NNN4 Statistically-Significant Correlations 5 Solution Minimum statistically-significant r = 2/Sqrt(n) “n” is the number of pairs being correlated Less than 5% over for n between 5 and 4,000. Simple and memorable for two variables. It is similar to the formula for the maximum 95% Margin of Error in samples from a binary variable: 95% ME = 1.96 Sqrt[p*(1-p)/n] < 2 Sqrt[1/(4n)] 95% ME < 1/Sqrt(n) Simple and memorable for one binary variable.

  9. 0F 2014 NNN4 Statistically-Significant Correlations 6 Time-Series Correlations www.tylervigen.com Tangled bed ‐ sheet Deaths vs. Skiing Revenues 900 2600 Revenues: Blue line 800 2400 Revenues ($M) Deaths (US) 700 2200 600 2000 500 1800 Correlation: 0.969724 400 1600 300 1400 2000 2002 2004 2006 2008 Source: http://tylervigen.com/view_correlation?id=1864 10 pairs; 2/Sqrt(10) = 0.63; Statistically significant

  10. 0F 2014 NNN4 Statistically-Significant Correlations 7 Correlation = -0.993 Bee colonies & MJ arrests Bee Colonies vs. Juvenile Marijuana Arrests 3,400 95,000 3,200 85,000 Marijuna Arrests of Juveniles 75,000 3,000 Colonies (K) 65,000 Arrests Correlation: ‐ 0.933389 2,800 55,000 45,000 Honey Bee Colonies 2,600 35,000 2,400 25,000 2,200 15,000 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 www.tylervigen.com/view_correlation?id=1582 20 pairs; 2/Sqrt(20) = 0.45; Statistically-significant

Recommend


More recommend