discovering correlation
play

Discovering Correlation Jill illes V s Vreeken 5 5 June 2015 - PowerPoint PPT Presentation

Discovering Correlation Jill illes V s Vreeken 5 5 June 2015 2015 Questions of the day What is correl elatio ion, how can we measure it, and how can di disc scover it? Correlation the relationship between things that happen or


  1. Discovering Correlation Jill illes V s Vreeken 5 5 June 2015 2015

  2. Questions of the day What is correl elatio ion, how can we measure it, and how can di disc scover it?

  3. Correlation β€˜the relationship between things that happen or change together’ (Merriam-Webster)

  4. 𝜍 = 0.947

  5. Correlation β€˜a relation existing between phenomena or things or between mathematical or statistical variables which tend to vary, be associated, or occur together in a way not expected on the basis of chance alone’ (Merriam-Webster)

  6. Correlation β€˜a relation ion existing between phenomena or things or between mathematical or statistical variables which te tend to to vary, be associated, or occur toget ether er in n a way no y not exp expected o on the basis of chance a alon one’ (Merriam-Webster)

  7. Good Ol’ Pearson Pearson product-moment correlation coefficient one of the most well-known measures for correlation 𝜍 π‘Œ , 𝑍 = 𝑑𝑑𝑑𝑑 π‘Œ , 𝑍 = 𝐹 π‘Œ βˆ’ 𝜈 π‘Œ 𝑍 βˆ’ 𝜈 𝑍 𝜏 π‘Œ 𝜏 𝑍 That is, covariance divided by standard deviation. Pearson detects only lin linear correlations

  8. Pearson in action (Wikipedia , yes really )

  9. 𝜍 = 0.998

  10. Chance alone… Last week, we discussed Shannon e entropy and mutual i information Can we use these to measure correlation? Yes, we can! Shannon entropy works very well for discrete data: e.g. low-entropy sets for continuous valued data: …

  11. Shannon entropy for continuous As discussed last week, to compute β„Ž π‘Œ = βˆ’ �𝑔 𝑦 log 𝑔 𝑦 𝑒𝑦 𝐘 We need to estimate the probability density function, choose a step-size, and then hope for the best. If we don’t know the distribution, we can use kernel density estimation – which requires choosing a kernel and a bandwidth. KDE is well-behaved for univariate, but estimating multivariate densities is very difficult, especially for high dimensionalities.

  12. MIC MIC: Maximal Information Coefficient A few years back, there was a big stir about MIC, a measure for non-linear correlations between pairs of variables. The main idea in a nutshell: If we want to measure the correlation of real-valued π‘Œ and 𝑍 , why not discretize ize the data, and compute mutual information!? That is, just find those π‘Œπ‘Œ and π‘π‘Œ such that 𝐽 ( π‘Œπ‘Œ ; π‘π‘Œ ) is maximal, and treat that value as the correlation measure. (Reshef et al, 2011)

  13. MIC in a pic Given 𝐸 βŠ‚ ℝ 2 and integers 𝑦 and 𝑧 , 𝐽 βˆ— 𝐸 , 𝑦 , 𝑧 = max 𝐽 ( 𝐸 | 𝐻 ) with 𝐻 over all grids of 𝑦 cols, 𝑧 rows. Normalise this score by independence 𝐽 βˆ— 𝐸 , 𝑦 , 𝑧 log min 𝑦 , 𝑧 𝑁 𝐸 𝑦 , 𝑧 = And return the maximum 𝑦𝑧<𝐢 ( π‘œ ) { 𝑁 𝐸 𝑦 , 𝑧 } 𝑁𝐽𝑁 𝐸 = max

  14. Mining with MIC MIC is strictly defined for pairs s of variables es which means… β€˜Mining’ is ea easy! y! We have to measure MIC for ever ery p y pair of attributes in our data, which we can then order by their MIC score.

  15. BAD MIC BAD MIC is a nice idea, but… stric ictly ly for pairs, heuris istic ic optimization, doesn’t like lin linear, and doesn’t like noise se at all And that are just a few of its drawbacks… Can we salvage the nice part? (Simon and Tibshirani, 2011)

  16. Cumulative Distributions 𝐺 ( 𝑦 ) = 𝑄 ( π‘Œ ≀ 𝑦 ) cdf df can be computed directly from data no no assumptions necessary

  17. Identifying Interacting Subspaces

  18. Cumulative Entropy Entropy has been defined for cumulative distribution functions! β„Ž 𝐷𝐷 π‘Œ = βˆ’ οΏ½ 𝑄 π‘Œ ≀ 𝑦 log 𝑄 π‘Œ ≀ 𝑦 𝑒𝑦 𝑒𝑒𝑒 π‘Œ As 0 ≀ 𝑄 π‘Œ ≀ 𝑦 ≀ 1 we obtain β„Ž 𝐷𝐷 π‘Œ β‰₯ 0 (!) (Rao et al, 2004, 2005)

  19. Cumulative Entropy How do we compute β„Ž 𝐷𝐷 ( π‘Œ ) in practice? Easy. Let π‘Œ 1 ≀ β‹― ≀ π‘Œ π‘œ be i.i.d. random samples of continuous random variable π‘Œ π‘œβˆ’1 π‘œ log 𝑗 𝑗 β„Ž 𝐷𝐷 π‘Œ = βˆ’ οΏ½ π‘Œ 𝑗+1 βˆ’ π‘Œ 𝑗 π‘œ 𝑗=1 (Rao et al, 2004, 2005, Crescenzo & Longobardi 2009)

  20. Multivariate Cumulative Entropy (1) First things first. We need β„Ž 𝐷𝐷 π‘Œ | 𝑍 = ∫ β„Ž 𝐷𝐷 π‘Œ 𝑧 π‘ž 𝑧 𝑒𝑧 which, in practice, means β„Ž 𝐷𝐷 π‘Œ | 𝑍 = οΏ½ β„Ž 𝐷𝐷 π‘Œ 𝑧 π‘ž ( 𝑧 ) π‘§βˆˆπ‘ with 𝑧 a discrete bin of data points over 𝑍 , and π‘ž 𝑧 = 𝑧 π‘œ How do we bin 𝑍 into 𝑧 ? We ca can n si simpl mply cl clus uster Y Y (Nguyen et al, 2013)

  21. Multivariate Cumulative Entropy (2) First things first. We need β„Ž 𝐷𝐷 π‘Œ | 𝑍 = ∫ β„Ž 𝐷𝐷 π‘Œ 𝑧 π‘ž 𝑧 𝑒𝑧 which, in practice, means β„Ž 𝐷𝐷 π‘Œ | 𝑍 = οΏ½ β„Ž 𝐷𝐷 π‘Œ 𝑧 π‘ž ( 𝑧 ) π‘§βˆˆπ‘ with 𝑧 a discrete bin of data points over 𝑍 , and π‘ž 𝑧 = 𝑧 π‘œ How do we bin 𝑍 into 𝑧 ? Fin Find t the dis iscretis isation on of Y Y such such th that at β„Ž 𝐷𝐷 π‘Œ 𝑍 is mi mini nima mal (Nguyen et al, 2014)

  22. Cumulative Mutual Information We cannot (realistically) calculate β„Ž 𝐷𝐷 π‘Œ 1 , … , π‘Œ 𝑒 in one go yet… entropy has a factorization property, so, what we can do is οΏ½ β„Ž 𝐷𝐷 π‘Œ 𝑗 βˆ’ οΏ½ β„Ž 𝐷𝐷 ( π‘Œ 𝑗 | π‘Œ 1 , … , π‘Œ π‘—βˆ’1 ) 𝑗=2 𝑗=2 (Nguyen et al, 2013)

  23. Mining for Interaction super simple: a priori-style

  24. Mining interacting attributes CMI: use apriori principle, mine all attribute sets with β„Ž 𝐷𝐷 ≀ 𝜏 (Nguyen et al, 2013ab)

  25. Measuring Multivariate Correlations MIC is exclusively defined for pairs  score and approach does not scale up to higher dimensions Entrez, MAC  Multivariate Maximal Correlation Analysis (Nguyen et al, 2014)

  26. Maximal Correlation Analysis The maxim imal co l correla latio ion of 𝑒 a set of real-valued random variables π‘Œ 𝑗 𝑗=1 is defined as 𝑁𝑑𝑑𝑑 βˆ— π‘Œ 1 , … , π‘Œ 𝑒 = max 𝑒 π‘Œ 𝑒 ) 𝑛 𝑁𝑑𝑑𝑑 ( 𝑔 1 π‘Œ 1 , … , 𝑔 𝑔 1 ,…, 𝑔 where 𝑁𝑑𝑑𝑑 is a correlation measure, 𝑗 ∢ 𝑒𝑑𝑒 π‘Œ 𝑗 β†’ 𝐡 𝑗 is drawn from a pre-specified class 𝑔 of functions, and 𝐡 𝑗 βŠ† ℝ

  27. T otal Correlation Finds the ch chain in of pairwise grids that min inim imiz izes the entropy, that maxim imize izes correlation The total co l correlatio ion of a dataset D is 𝐽 𝐸 = οΏ½ 𝐼 π‘Œ 𝑗 βˆ’ 𝐼 ( π‘Œ 1 , … , π‘Œ 𝑒 ) 𝑗=1 (Nguyen et al, 2014)

  28. Maximal Discretized Correlation Let’s say our data is real valued, but that we have a discretization grid 𝐻 , then we have 𝑕 𝑗 βˆ’ 𝐼 ( π‘Œ 1 𝐽 𝐸 𝐻 = οΏ½ 𝐼 π‘Œ 𝑗 𝑕 1 , … , π‘Œ 𝑒 𝑕 𝑛 ) 𝑗 To find the maxim imal co l correla latio ion, we hence need to find that grid 𝐻 for 𝐸 such that 𝐽 ( 𝐸 𝐻 ) is maximized. (Nguyen et al, 2014)

  29. Normalizing the Score However, 𝐽 ( 𝐸 𝐻 ) strongly depends on the number of bins π‘œ 𝑗 for attribute 𝑗 . So, we should normalize by an upper bound. 𝑒 ) 𝐽 𝐸 𝐻 ≀ οΏ½ log π‘œ 𝑗 βˆ’ max({log π‘œ 𝑗 } 𝑗=1 𝑗 (Nguyen et al, 2014)

  30. Normalizing the Score However, 𝐽 ( 𝐸 𝐻 ) depends on the number of bins π‘œ 𝑗 for attribute 𝑗 . So, we should normalize. We know 𝑒 ) 𝐽 𝐸 𝐻 ≀ οΏ½ log π‘œ 𝑗 βˆ’ max({log π‘œ 𝑗 } 𝑗=1 𝑗 by which we define 𝐽 𝐸 𝐻 𝐽 π‘œ 𝐸 𝐻 = 𝑒 }) βˆ‘ log π‘œ 𝑗 βˆ’ max ({log π‘œ 𝑗 } 𝑗=1 𝑗 as the nor ormaliz lized t tot otal l cor orrela latio ion (Nguyen et al, 2014)

  31. MAC After all that, we can now finally introduce MAC. 𝐽 π‘œ ( 𝐸 𝐻 ) 𝑁𝐡𝑁 𝐸 = max 𝐻= { 𝑕 1 ,…, 𝑕 𝑛 } βˆ€ π‘—β‰ π‘˜ π‘œ 𝑗 Γ— π‘œ π‘˜ <𝑂 1βˆ’πœ— How do we compute MAC? How do we choose G? Through cumulative entropy! (Nguyen et al, 2014)

  32. GOOD MAC GOOD Linear Circle

  33. NICE MAC NICE 20% noise 80% noise

  34. Mining with MAC super simple: a priori-style

  35. PRETTY MAC PRETTY 20% noise 80% noise

  36. Comparability of Scores So, we use a priori… but… are CMI, MIC, MAC, etc (anti)-monotonic? Is any meaningful correlation score monotonic?

  37. Spurious Correlations 𝜍 = 0.985

  38. Correlation does not imply… Correlation means a co co-rela lation ion is observed, which does es no not imply a casual relation.

  39. Correlation does not imply… If π‘Œ and 𝑍 are strongly correlated, this may have many reasons. Besides spurious, it may be that π‘Œ and 𝑍 are the result of an unobserved process π‘Ž . Next week we’ll investigate whether we can somehow tell if π‘Œ causes 𝑍 or vice versa.

  40. Correlation does not imply…

  41. Conclusions Correlation is almost anything deviating from chance Measuring multivariate correlation is difficult  especially if you want to be non on-param ametric  even more so if you want to measure non on-line inear interactions Entropy and Mutual Information are powerful tools  Shannon entropy for nominal data  cumulative entropy for ordinal data  discretise smartly for multivariate CE

  42. Thank you! 𝜍 = 0.870

Recommend


More recommend