Determining Significance Jilles Vreeken 19 June 2015 2015

Question of the day How can we find things that are interesting with regard to what we already know ?

What is interesting? something that in increas eases our knowledge about the data

What is a good result? something that reduces ces our uncertainty about the data (ie. increases the likelihood of the data)

What is really good? something that, in sim imple le terms, strong ongly ly reduces ces our uncertainty about the data (maximise likelihood, but avoid overfitting)

Measuring Uncertainty We need access to the li likelihood elihood of data D given background knowledge B such that we can calculate the ga gain in for X …which distribution should we use?

Measuring Surprise We need access to the lik ikel elihoo ihood d of result X given background knowledge B such that we can mine the data for X that have a low likelihood, that are surpri rprising ing …which distribution should we use?

Measuring Surprise We need access to the lik ikel elihoo ihood d of result X given background knowledge B This is called the p-value of result X such that we can mine the data for X that have a low likelihood, that are surpri rprising ing …which distribution should we use?

Measuring Surprise We need access to the lik ikel elihoo ihood d of result X given background knowledge B This is called the The p-value corresponds to the p-value Frequentist probability of the result being of result X generated by the null-hypothesis such that we can mine the data for X that have a low likelihood, that are surpri rprising ing …which distribution should we use?

Measuring Surprise We need access to the lik ikel elihoo ihood d of result X given background knowledge B This is called the p-value of result X such that we can mine the data for X that have a low likelihood, that are surpri rprising ing …which distribution should we use?

Background Knowledge We do not want to have to choose a distribution We want to be able to test significance against what t we we already ady know. That is, our null hypothesis is ‘The results are explained by what t we we kn know about the data’ But, what do we know about the data? And, how do we test against this?

Approach 1: Randomization Mine original data 1. Mine random data 2. Determine probability 3. Random Random Random Original ... data #1 data #2 data #N data score ( X | D )

Approach 1: Randomization Mine original data 1. Mine random data 2. Determine probability The fraction of better ‘ randoms ’ is the 3. empirical p-value of result X Random Random Random Original ... data #1 data #2 data #N data score ( X | D )

Empirical p -values Let 𝐸 be our data and 𝐶 our background knowledge. Let 𝑉(𝐶) be the space of all data that satisfies 𝐶 . Let 𝑇 ⊆ 𝑉(𝐶) be a uniform form random sample of 𝑉(𝐶) . Let 𝑆(𝐸) be a single gle number our data mining method results. (e.g., the frequency of an itemset, the number of frequent itemsets at a chosen minsup, the average value over some area, the clustering error, the compressed size of the data, the accuracy, etc, etc) The empirical irical 𝒒 -value lue of 𝑆 𝐸 being ‘big’ then is 𝐸 ′ ∈ 𝑇 𝑆 𝐸 ′ ≥ 𝑆 𝐸 + 1 𝑇 + 1

More on empirical p -values The empirical irical 𝒒 -value lue of 𝑆 𝐸 being ‘big’ is 𝐸 ′ ∈ 𝑇 𝑆 𝐸 ′ ≥ 𝑆 𝐸 + 1 𝑇 + 1 We have the +1’s to avoid 0s. If 𝑇 = 𝑉(𝐶) this is an exact ct test, and then the +1s are not needed Clearly, the bigger the sample 𝑇 the better. It controls the maximum accuracy, the resolut olution ion of the empirical p-value. If you want to measure significance at 𝑞 = 0.05 you need at least 20 samples (and rather, many many more)

Approach 1: Randomization Mine original data 1. Mine random data 2. Determine probability The fraction of better ‘ randoms ’ is the 3. empirical p-value of result X Random Random Random Original ... data #1 data #2 data #N data score ( X | D )

Random Data So, we now we just need lots of data sets that are  maintain our background knowledge,  completely random otherwise How can we get our hands on such data? and, how do we sample it unifor orml mly y at rando dom? This depends on the type of data, and the type(s) of background knowledge we want to maintain.

Exa xample mple: : Binary Data For now, let us simply consider binar ary y data ata.

Exa xample mple: : Binary Data Let there be data 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 1 0 1 1 1 0 0 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 (swap randomization, Gionis et al. 2005)

Exa xample mple: : Binary Data Say we only know overall density. How to sample random data? 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 1 0 1 1 1 0 0 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 27 (swap randomization, Gionis et al. 2005)

Exa xample mple: : Binary Data Didactically, let us instead consider a Monte-Carlo Markov Chain 1 1 1 0 1 1 1 Very simple scheme 0 1 1 0 1 0 1 1 1 1 1 0 0 0 1. select two cells at random, 1 1 1 1 0 0 1 2. swap values, 0 1 1 1 0 0 0 3. repeat until convergence. 0 1 1 1 0 1 0 0 0 0 0 1 0 0 27 (swap randomization, Gionis et al. 2005)

Swap Randomization Margins are easy understandable for binary data, how can we sample data with same margins? 1 1 1 0 1 1 1 6 0 1 1 0 1 0 1 4 1 1 1 1 0 0 0 4 1 1 1 1 0 0 1 5 0 1 1 1 0 0 0 3 0 1 1 1 0 1 0 4 0 0 0 0 1 0 0 1 3 6 6 4 3 2 3 27 (swap randomization, Gionis et al. 2005)

Swap Randomization By MCMC! 1 1 1 0 1 1 1 6 0 1 1 0 1 0 1 4 1. randomly find submatrix 1 1 1 1 0 0 0 4 1 1 1 1 0 0 1 5 0 1 1 1 0 0 0 3 0 1 1 1 0 1 0 4 0 0 0 0 1 0 0 1 3 6 6 4 3 2 3 27 (swap randomization, Gionis et al. 2005)

Swap Randomization By MCMC! 1 1 1 0 1 1 1 6 0 1 1 0 1 0 1 4 1. randomly find submatrix 1 1 1 1 0 0 0 4 1 1 1 1 0 0 1 5 0 1 1 1 0 0 0 3 0 1 1 1 0 1 0 4 0 0 0 0 1 0 0 1 2. swap values 3 6 6 4 3 2 3 27 (swap randomization, Gionis et al. 2005)

Swap Randomization By MCMC! 1 1 1 1 1 1 0 0 1 1 1 1 1 1 6 6 0 1 1 0 1 1 1 1 0 0 0 1 0 1 4 4 1. randomly find submatrix 1 1 1 1 1 1 1 0 0 1 0 0 0 0 4 4 1 1 1 1 1 1 1 1 1 0 0 0 1 0 5 5 0 0 1 1 1 1 0 1 0 0 0 0 1 0 3 3 0 0 1 1 1 1 1 1 0 0 1 1 0 0 4 4 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 2. swap values 3 3 6 6 6 6 4 4 3 3 2 2 3 3 27 27 3. repeat until convergence (swap randomization, Gionis et al. 2005)

Hopping through Sample Space 𝑬’ 𝟐 𝑬’ 𝟓 𝑬 𝑬’ 𝟑 swap 𝑬’ 𝟒 The neighbours ′ ∈ 𝑉(𝐶) of 𝐸 are 𝐸 𝑗 all reachable with 1 swap from 𝐸

Subtle issue For unbiased testing, we need to sample uniformly from 𝑉(𝐶) . Are all datasets in 𝑉 𝐶 reachable from 𝐸 by swapping? Can the ‘swap - graph’ of 𝑉 𝐶 be disconnected?

Subtle issue For unbiased testing, we need to sample uniformly from 𝑉(𝐶) . Are all datasets in 𝑉 𝐶 reachable from 𝐸 by swapping? Theorem [Ryser ‘57]. If 𝐵, 𝐶 ∈ 𝑁 𝑠, 𝑑 , then A is reachable from B with a finite number of swaps

Hopping through Sample Space A path through this graph is called a chain.

Beware! Subsequent states in Markov chains are dependent. Which means, subsequent samples are dependent. This is not a proble lem m if we let the chain co conve verge ge between drawing samples, but estimating mixing time is hard. If we would simply take the original data as the starting point to sample random data, all samples will be bia iased ed.

Determining Significance Jilles Vreeken 19 June 2015 2015 - PowerPoint PPT Presentation

Determining Significance Jilles Vreeken 19 June 2015 2015 Question of the day How can we find things that are interesting with regard to what we already know ? What is interesting? something that in increas eases our knowledge about the

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Significance How important is it? Thoughts on historical significance A property must have

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

Statistical-Significance Background & Goal Shortcuts Statistical significance is one of

Significance Significance of of Guanx Guanxi Yan anji jie e Bian Bian University of

The Significance of The Significance of Sustainable Sustainable Development in in Development

The Significance of Snowdrops A whizz through the why, what, how of Significance

Concept and Significance of Concept and Significance of Green Purchasing Green Purchasing

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Medical Medical and social and social significance significance of str of stroke oke

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Background and Background and Significance Significance Institutional Commitment Institutional

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha

A Sampling-Based Tool for Scaling Graph Datasets ICPE2020 11 th ACM / SPEC International

Local access to Huge Random Objects Amartya Shankha Biswas (MIT) Ronitt Rubinfeld (MIT and TAU)

Sampling Online Social Networks Athina Markopoulou 1,3 Joint work with: Minas Gjoka 3 , Maciej

Motion Planning n Problem n Given start state x S , goal state x G n Asked for: a sequence

Lecture 8 ,10- Variance Reduction Welcome! , = (, )

Efficient Simulation of Random States and Random Unitaries Gorjan Alagic, Christian Majenz and

Metropolis-Hastings algorithm Dr. Jarad Niemi STAT 544 - Iowa State University April 2, 2019

Data Analysis and Uncertainty Part 1: Random Variables Instructor: Sargur N. Srihari University

Determining Significance Jilles Vreeken 19 June 2015 2015 - PowerPoint PPT Presentation

Determining Significance Jilles Vreeken 19 June 2015 2015 Question of the day How can we find things that are interesting with regard to what we already know ? What is interesting? something that in increas eases our knowledge about the

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Significance How important is it? Thoughts on historical significance A property must have

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

Statistical-Significance Background &amp; Goal Shortcuts Statistical significance is one of

Significance Significance of of Guanx Guanxi Yan anji jie e Bian Bian University of

The Significance of The Significance of Sustainable Sustainable Development in in Development

The Significance of Snowdrops A whizz through the why, what, how of Significance

Concept and Significance of Concept and Significance of Green Purchasing Green Purchasing

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Medical Medical and social and social significance significance of str of stroke oke

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Background and Background and Significance Significance Institutional Commitment Institutional

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha

A Sampling-Based Tool for Scaling Graph Datasets ICPE2020 11 th ACM / SPEC International

Local access to Huge Random Objects Amartya Shankha Biswas (MIT) Ronitt Rubinfeld (MIT and TAU)

Sampling Online Social Networks Athina Markopoulou 1,3 Joint work with: Minas Gjoka 3 , Maciej

Motion Planning n Problem n Given start state x S , goal state x G n Asked for: a sequence

Lecture 8 ,10- Variance Reduction Welcome! , = (, )

Efficient Simulation of Random States and Random Unitaries Gorjan Alagic, Christian Majenz and

Metropolis-Hastings algorithm Dr. Jarad Niemi STAT 544 - Iowa State University April 2, 2019

Data Analysis and Uncertainty Part 1: Random Variables Instructor: Sargur N. Srihari University

Statistical-Significance Background & Goal Shortcuts Statistical significance is one of