Discovering Correlation Jill illes V s Vreeken 5 5 June 2015 2015
Questions of the day What is correl elatio ion, how can we measure it, and how can di disc scover it?
Correlation βthe relationship between things that happen or change togetherβ (Merriam-Webster)
π = 0.947
Correlation βa relation existing between phenomena or things or between mathematical or statistical variables which tend to vary, be associated, or occur together in a way not expected on the basis of chance aloneβ (Merriam-Webster)
Correlation βa relation ion existing between phenomena or things or between mathematical or statistical variables which te tend to to vary, be associated, or occur toget ether er in n a way no y not exp expected o on the basis of chance a alon oneβ (Merriam-Webster)
Good Olβ Pearson Pearson product-moment correlation coefficient one of the most well-known measures for correlation π π , π = ππππ π , π = πΉ π β π π π β π π π π π π That is, covariance divided by standard deviation. Pearson detects only lin linear correlations
Pearson in action (Wikipedia , yes really )
π = 0.998
Chance aloneβ¦ Last week, we discussed Shannon e entropy and mutual i information Can we use these to measure correlation? Yes, we can! Shannon entropy works very well for discrete data: e.g. low-entropy sets for continuous valued data: β¦
Shannon entropy for continuous As discussed last week, to compute β π = β οΏ½π π¦ log π π¦ ππ¦ π We need to estimate the probability density function, choose a step-size, and then hope for the best. If we donβt know the distribution, we can use kernel density estimation β which requires choosing a kernel and a bandwidth. KDE is well-behaved for univariate, but estimating multivariate densities is very difficult, especially for high dimensionalities.
MIC MIC: Maximal Information Coefficient A few years back, there was a big stir about MIC, a measure for non-linear correlations between pairs of variables. The main idea in a nutshell: If we want to measure the correlation of real-valued π and π , why not discretize ize the data, and compute mutual information!? That is, just find those ππ and ππ such that π½ ( ππ ; ππ ) is maximal, and treat that value as the correlation measure. (Reshef et al, 2011)
MIC in a pic Given πΈ β β 2 and integers π¦ and π§ , π½ β πΈ , π¦ , π§ = max π½ ( πΈ | π» ) with π» over all grids of π¦ cols, π§ rows. Normalise this score by independence π½ β πΈ , π¦ , π§ log min π¦ , π§ π πΈ π¦ , π§ = And return the maximum π¦π§<πΆ ( π ) { π πΈ π¦ , π§ } ππ½π πΈ = max
Mining with MIC MIC is strictly defined for pairs s of variables es which meansβ¦ βMiningβ is ea easy! y! We have to measure MIC for ever ery p y pair of attributes in our data, which we can then order by their MIC score.
BAD MIC BAD MIC is a nice idea, butβ¦ stric ictly ly for pairs, heuris istic ic optimization, doesnβt like lin linear, and doesnβt like noise se at all And that are just a few of its drawbacksβ¦ Can we salvage the nice part? (Simon and Tibshirani, 2011)
Cumulative Distributions πΊ ( π¦ ) = π ( π β€ π¦ ) cdf df can be computed directly from data no no assumptions necessary
Identifying Interacting Subspaces
Cumulative Entropy Entropy has been defined for cumulative distribution functions! β π·π· π = β οΏ½ π π β€ π¦ log π π β€ π¦ ππ¦ πππ π As 0 β€ π π β€ π¦ β€ 1 we obtain β π·π· π β₯ 0 (!) (Rao et al, 2004, 2005)
Cumulative Entropy How do we compute β π·π· ( π ) in practice? Easy. Let π 1 β€ β― β€ π π be i.i.d. random samples of continuous random variable π πβ1 π log π π β π·π· π = β οΏ½ π π+1 β π π π π=1 (Rao et al, 2004, 2005, Crescenzo & Longobardi 2009)
Multivariate Cumulative Entropy (1) First things first. We need β π·π· π | π = β« β π·π· π π§ π π§ ππ§ which, in practice, means β π·π· π | π = οΏ½ β π·π· π π§ π ( π§ ) π§βπ with π§ a discrete bin of data points over π , and π π§ = π§ π How do we bin π into π§ ? We ca can n si simpl mply cl clus uster Y Y (Nguyen et al, 2013)
Multivariate Cumulative Entropy (2) First things first. We need β π·π· π | π = β« β π·π· π π§ π π§ ππ§ which, in practice, means β π·π· π | π = οΏ½ β π·π· π π§ π ( π§ ) π§βπ with π§ a discrete bin of data points over π , and π π§ = π§ π How do we bin π into π§ ? Fin Find t the dis iscretis isation on of Y Y such such th that at β π·π· π π is mi mini nima mal (Nguyen et al, 2014)
Cumulative Mutual Information We cannot (realistically) calculate β π·π· π 1 , β¦ , π π in one go yetβ¦ entropy has a factorization property, so, what we can do is οΏ½ β π·π· π π β οΏ½ β π·π· ( π π | π 1 , β¦ , π πβ1 ) π=2 π=2 (Nguyen et al, 2013)
Mining for Interaction super simple: a priori-style
Mining interacting attributes CMI: use apriori principle, mine all attribute sets with β π·π· β€ π (Nguyen et al, 2013ab)
Measuring Multivariate Correlations MIC is exclusively defined for pairs ο§ score and approach does not scale up to higher dimensions Entrez, MAC ο§ Multivariate Maximal Correlation Analysis (Nguyen et al, 2014)
Maximal Correlation Analysis The maxim imal co l correla latio ion of π a set of real-valued random variables π π π=1 is defined as ππππ β π 1 , β¦ , π π = max π π π ) π ππππ ( π 1 π 1 , β¦ , π π 1 ,β¦, π where ππππ is a correlation measure, π βΆ πππ π π β π΅ π is drawn from a pre-specified class π of functions, and π΅ π β β
T otal Correlation Finds the ch chain in of pairwise grids that min inim imiz izes the entropy, that maxim imize izes correlation The total co l correlatio ion of a dataset D is π½ πΈ = οΏ½ πΌ π π β πΌ ( π 1 , β¦ , π π ) π=1 (Nguyen et al, 2014)
Maximal Discretized Correlation Letβs say our data is real valued, but that we have a discretization grid π» , then we have π π β πΌ ( π 1 π½ πΈ π» = οΏ½ πΌ π π π 1 , β¦ , π π π π ) π To find the maxim imal co l correla latio ion, we hence need to find that grid π» for πΈ such that π½ ( πΈ π» ) is maximized. (Nguyen et al, 2014)
Normalizing the Score However, π½ ( πΈ π» ) strongly depends on the number of bins π π for attribute π . So, we should normalize by an upper bound. π ) π½ πΈ π» β€ οΏ½ log π π β max({log π π } π=1 π (Nguyen et al, 2014)
Normalizing the Score However, π½ ( πΈ π» ) depends on the number of bins π π for attribute π . So, we should normalize. We know π ) π½ πΈ π» β€ οΏ½ log π π β max({log π π } π=1 π by which we define π½ πΈ π» π½ π πΈ π» = π }) β log π π β max ({log π π } π=1 π as the nor ormaliz lized t tot otal l cor orrela latio ion (Nguyen et al, 2014)
MAC After all that, we can now finally introduce MAC. π½ π ( πΈ π» ) ππ΅π πΈ = max π»= { π 1 ,β¦, π π } β πβ π π π Γ π π <π 1βπ How do we compute MAC? How do we choose G? Through cumulative entropy! (Nguyen et al, 2014)
GOOD MAC GOOD Linear Circle
NICE MAC NICE 20% noise 80% noise
Mining with MAC super simple: a priori-style
PRETTY MAC PRETTY 20% noise 80% noise
Comparability of Scores So, we use a priori⦠but⦠are CMI, MIC, MAC, etc (anti)-monotonic? Is any meaningful correlation score monotonic?
Spurious Correlations π = 0.985
Correlation does not imply⦠Correlation means a co co-rela lation ion is observed, which does es no not imply a casual relation.
Correlation does not implyβ¦ If π and π are strongly correlated, this may have many reasons. Besides spurious, it may be that π and π are the result of an unobserved process π . Next week weβll investigate whether we can somehow tell if π causes π or vice versa.
Correlation does not implyβ¦
Conclusions Correlation is almost anything deviating from chance Measuring multivariate correlation is difficult ο§ especially if you want to be non on-param ametric ο§ even more so if you want to measure non on-line inear interactions Entropy and Mutual Information are powerful tools ο§ Shannon entropy for nominal data ο§ cumulative entropy for ordinal data ο§ discretise smartly for multivariate CE
Thank you! π = 0.870
Recommend
More recommend