modelling correlations with python and scipy
play

Modelling correlations with Python and SciPy Eric Marsden - PowerPoint PPT Presentation

Modelling correlations with Python and SciPy Eric Marsden <eric.marsden@risk-engineering.org> Process safety engineer: To what extent does increased process temperature and pressure increase the level of corrosion of my


  1. Modelling correlations with Python and SciPy Eric Marsden <eric.marsden@risk-engineering.org>

  2. • Process safety engineer: “To what extent does increased process temperature and pressure increase the level of corrosion of my equipment?” • Medical researcher: “What is the mortality impact of smoking 2 packets of cigarettes per day?” • Safety regulator: “Do more frequent site inspections lead to a lower accident rate?” • Life insurer: “What is the conditional probability when one spouse dies, that the other will die shortly afuerwards?” correlation analysis together, including the strength and direction of their relationship 2 / 30 Context ▷ Analysis of causal efgects is an important activity in risk analysis ▷ Tie simplest statistical technique for analyzing causal efgects is ▷ Correlation analysis measures the extent to which two variables vary

  3. of a linear association between two random variables • also called the Pearson product-moment correlation coeffjcient 𝜏 𝑌 𝜏 𝑍 𝜏 𝑌 𝜏 𝑍 • 𝔽 is the expectation operator • cov means covariance 3 / 30 Measuring linear correlation ▷ Linear correlation coeffjcient : a measure of the strength and direction ▷ 𝜍 𝑌,𝑍 = 𝑑𝑝𝑤(𝑌,𝑍) = 𝔽[(𝑌−𝜈 𝑌 )(𝑍−𝜈 𝑍 )] • 𝜈 𝑌 is the expected value of random variable 𝑌 • 𝜏 𝑌 is the standard deviation of 𝑌 ▷ Python: scipy.stats.pearsonr(X, Y) ▷ Excel / Google Docs spreadsheet: function CORREL

  4. Tie linear correlation coeffjcient ρ quantifjes the strengths and directions of movements in two random variables: to +1) the other variable moves • does not imply that they are independent! 4 / 30 Measuring linear correlation ▷ sign of ρ determines the relative directions that the variables move in ▷ value determines strength of the relative movements (ranging from -1 ▷ ρ = 0.5: one variable moves in the same direction by half the amount that ▷ ρ = 0: variables are uncorrelated

  5. e n c y d e p e n d o n ≠ r r e l a t i c o 5 / 30 Image source: Wikipedia Examples of correlations

  6. e n c y d e p e n d o n ≠ r r e l a t i c o 5 / 30 Image source: Wikipedia Examples of correlations

  7. 5 / 30 Image source: Wikipedia Examples of correlations e n c y d e p e n d o n ≠ r r e l a t i c o

  8. 6 / 30 Try it out online: rpsychologist.com/d3/correlation/ Online visualization: interpreting correlations

  9. • empirical relationship between level of arousal/stress and level of performance stress/arousal decreases Source: wikipedia.org/wiki/Yerkes–Dodson_law 7 / 30 Not all relationships are linear! ▷ Example: Yerkes–Dodson law ▷ Performance initially increases with ▷ Beyond a certain level of stress, performance

  10. 8 / 30 Measuring correlation with NumPy In [3]: import numpy import matplotlib.pyplot as plt import scipy.stats In [4]: X = numpy.random.normal(10, 1, 100) Y = X + numpy.random.normal(0, 0.3, 100) plt.scatter(X, Y) Out[4]: <matplotlib.collections.PathCollection at 0x7f7443e3c438> o r h e e r r w h e n t t h a t s h o w e r c i s e : E x t i o n c o r r e l a s , t h e c r e a s e n d e i 𝑍 e s c r e a s i e n t i n c o e f f i c t d a p l o t a a n u c e d a : p r o d e r c i s e E x n r e l a t i o v e c o r n e g a t i w i t h a c i e n t c o e f f i In [5]: scipy.stats.pearsonr(X, Y) Out[5]: (0.9560266103379802, 5.2241043747083435e-54)

  11. 9 / 30 Four datasets proposed by Francis Anscombe to illustrate blindly on summary statistics the importance of graphing data rather than relying Anscombe’s quartet I II 12 8 4 III IV 12 8 a m e t h e s s e t h a s h d a t a E a c 4 ! f i c i e n t n c o e f r e l a t i o c o r 0 10 20 0 10 20

  12. 10 / 30 plt.scatter(X, Y, alpha=0.5) plt.show() import matplotlib.pyplot as plt import numpy X = numpy.random.uniform(0, 10, 100) Plotting relationships between variables with matplotlib ▷ Scatterplot: use function plt.scatter 14 ▷ Continuous plot or X-Y: function plt.plot 12 10 8 6 4 2 0 − 2 − 2 0 2 4 6 8 10 12 Y = X + numpy.random.uniform(0, 2, 100)

  13. between multiple variables at the same time coeffjcient between variables 𝑗 and 𝑘 • note: diagonal elements are always 1 • can be visualized graphically using a correlogram • allows you to see which variables in your data are informative • dataframe.corr() method from the Pandas library • numpy.corrcoef(data) from the NumPy library • visualize using imshow from Matplotlib or heatmap from the Seaborn library 11 / 30 ▷ A correlation matrix is used to investigate the dependence • output: a symmetric matrix where element 𝑛 𝑗𝑘 is the correlation Correlation matrix ▷ In Python, can use:

  14. 12 / 30 Analysis of the correlations between Data source: UK Department for Transport, data.gov.uk/dataset/road-accidents-safety-data plt.xticks(rotation=90) plt.yticks(rotation=0) sns.heatmap(cm, square=True) cm = data.corr() data = read_csv("casualties.csv") import seaborn as sns import matplotlib.pyplot as plt from pandas import read_csv casualties difgerent variables afgecting road Correlation matrix: example Vehicle_Reference 0.8 Casualty_Reference Casualty_Class Sex_of_Casualty 0.4 Age_of_Casualty Age_Band_of_Casualty Casualty_Severity 0.0 Pedestrian_Location Pedestrian_Movement Car_Passenger − 0.4 Bus_or_Coach_Passenger Pedestrian_Road_Maintenance_Worker Casualty_Type − 0.8 Casualty_Home_Area_Type Vehicle_Reference Pedestrian_Location Pedestrian_Movement Casualty_Reference Casualty_Class Sex_of_Casualty Age_of_Casualty Casualty_Severity Car_Passenger Bus_or_Coach_Passenger Pedestrian_Road_Maintenance_Worker Casualty_Type Casualty_Home_Area_Type Age_Band_of_Casualty

  15. incidence of polio increased with the consumption of ice cream afgects young children variable • but it sure is a hint! [Edward Tufue] More info: Freakonomics , Steven Levitt and Stephen J. Dubner 13 / 30 ▷ Polio: an infectious disease causing paralysis, which primarily Aside: polio caused by ice cream! ▷ Largely eliminated today, but was once a worldwide concern ▷ Late 1940s: public health experts in usa noticed that the ▷ Some suspected that ice cream caused polio… sales plummeted ▷ Polio incidence increases in hot summer weather ▷ Correlation is not causation: there may be a hidden, underlying

  16. the scene, the worse the damage! 14 / 30 Aside: fjre fjghters and fjre damage ▷ Statistical fact: the larger the number of fjre-fjghters attending ▷ More fjre fjghters are sent to larger fjres ▷ Larger fjres lead to more damage ▷ Lurking (underlying) variable = fjre size ▷ An instance of “Simpson’s paradox”

  17. a lower infant mortality rate than the low birth weight children of non-smokers mortality rate than others than babies of non-smoking mothers than children who have other, more severe, medical reasons why they are born underweight Source: Wilcox, A. (2001). On the importance — and the unimportance — of birthweight , International Journal of Epidemiology. 30:1233–1241 15 / 30 Aside: low birth weight babies of tobacco smoking mothers ▷ Statistical fact: low birth-weight children born to smoking mothers have ▷ In a given population, low birth weight babies have a signifjcantly higher ▷ Babies of mothers who smoke are more likely to be of low birth weight ▷ Babies underweight because of smoking still have a lower mortality rate ▷ Lurking variable between smoking, birth weight and infant mortality

  18. announced a plan to mail one book a month to every child in in the state from the time they were born until they entered kindergarten. Tie plan would cost 26 million usd a year. books do better on tests in school better even if they never read… where learning is encouraged and rewarded Source: freakonomics.com/2008/12/10/the-blagojevich-upside/ 16 / 30 Aside: exposure to books leads to higher test scores ▷ In early 2004, the governor of the us state of Illinois R. Blagojevich ▷ Data underlying the plan: children in households where there are more ▷ Later studies showed that children from homes with many books did ▷ Lurking variable: homes where parents buy books have an environment

  19. 17 / 30 Source: Chocolate Consumption, Cognitive Function, and Nobel Laureates , N Engl J Med 2012, doi : 10.1056/NEJMon1211064 Aside: chocolate consumption produces Nobel prizes

  20. Source: tylervigen.com , with many more surprising correlations Note: real data! 18 / 30 Aside: cheese causes death by bedsheet strangulation

  21. p o s t t h e “ c a l l e d t h i s i s l o g i c , I n a c y c ” f a l l t e r h o o p r o p o c e r g h 19 / 30 1964: the US Surgeon General issues a factor? hidden cancer lung smoking nicotine. causes both lung cancer and desire for be some hidden genetic factor that to demonstrate causality. Tiere might However, correlation is not suffjcient medical studies. mostly on correlation data from smoking causes lung cancer, based report claiming that cigarette Beware assumptions of causality

Recommend


More recommend