Maximum Entropy & Subjective Interestingness Jill illes V s - PowerPoint PPT Presentation

Maximum Entropy & Subjective Interestingness Jill illes V s Vreeken 26 June une 2015 2015

Questions of the day How can we find things that are interesting with regard to what we already know ? How can we measure subjective interestingness ?

What is interesting? something that increa ease ses our knowledge about the data

What is a good result? something that red educ uces our uncertainty about the data (ie. increases the likelihood of the data)

What is really good? something that, in simpl ple term rms, st strongly r red educ uces our uncertainty about the data (maximise likelihood, but avoid overfitting)

Let’s make this visual universe  of possible datasets our dataset D

dimensions, margins Given what we know possible datasets, given current knowledge all possible our dataset D datasets

dimensions, margins, More knowledge... pattern P 1 all possible our dataset D datasets

dimensions, margins, Fewer possibilities... patterns P 1 and P 2 all possible our dataset D datasets

dimensions, margins, Less uncertainty. the key structure all possible our dataset D datasets

dimensions, margins, Maximising certainty patterns P 1 and P 2 knowledge added by P 2 all possible our dataset D datasets

How can we define ‘uncertainty’ and ‘simplicity’? interpr rpreta tability ty and informativen eness ess are intrinsically subjective

Measuring Uncertainty We need access to the likelihood ood of data D given background knowledge B 𝑞 ( 𝐸 ∣ 𝐶 ) such that we can calculate the gain for X − 𝑞 ( 𝐸 ∣ 𝐶 ) 𝑞 𝐸 𝐶 ∪ 𝑌 …which distribution should we use?

Measuring Surprise We need access to the likelihood of result X given background knowledge B 𝑞 ( 𝑌 ∣ 𝐶 ) such that we can mine the data for X that have a low likelihood, that are surp rpris rising …which distribution should we use?

Approach 2: Maximum Entropy ‘the best distribution 𝑞 ∗ satisfies the background knowledge, but makes no further assumptions’ (Jaynes 1957; De Bie 2009)

Approach 2: Maximum Entropy ‘the best distribution 𝑞 ∗ sati satisfies th the bac ackgr ground knowledge dge, but makes no further assumptions’ in other words, 𝑞 ∗ assigns the correct probability mass to the background knowledge instances: 𝑞 ∗ is a maximum likelihood estimator (Jaynes 1957; De Bie 2009)

Approach 2: Maximum Entropy ‘the best distribution 𝑞 ∗ satisfies the background knowledge, but ma makes no no fur urthe her a assum umption ons’ in other words, 𝑞 ∗ spreads probability mass around as evenly as possible: 𝑞 ∗ does not have any specific bias (Jaynes 1957; De Bie 2009)

Approach 2: Maximum Entropy ‘the best distribution 𝑞 ∗ satisfies the background knowledge, but makes no further assumptions’ ver ery usef useful for data mining: unbiased sed measurement of subject ective e interest estingness ness (Jaynes 1957; De Bie 2009)

Constraints and Distributions Let 𝐶 be our set of constraints 𝑜 } 𝐶 = { 𝑔 1 , … , 𝑔 Let 𝐷 be the set of admissible distributions 𝑗 ∈ 𝐶 } 𝐷 = 𝑞 ∈ 𝐐 𝑞 𝑔 𝑗 = 𝑞 � 𝑔 𝑗 for 𝑔 We need the most uni unifor ormly distributed 𝑞 ∈ 𝐐

Uniformity and Entropy Uniformity ↔ Entropy 𝐼 𝑞 = − � 𝑞 ( 𝑌 = 𝑦 )log 𝑞 ( 𝑌 = 𝑦 ) 𝑦∈𝐘 tells us the ent ntrop opy of a (discrete) distribution 𝑞

Maximum Entropy We want access to the distribution 𝑞 ∗ with maximum mum entro ropy ∗ = argmax 𝑞∈𝐷 𝐼 ( 𝑞 ) 𝑞 𝐶 better known as the ma maximum um e ent ntropy mod model for constraints set 𝐶

Maximum Entropy We want access to the distribution 𝑞 ∗ with maximum mum entro ropy ∗ = argmax 𝑞∈𝐷 𝐼 ( 𝑞 ) It can be shown that 𝑞 ∗ is well defined 𝑞 𝐶 there always exist a unique 𝑞 ∗ with maximum entropy for any constrained set 𝐷 better known as the ma maximum um e ent ntropy mod model for constraints set 𝐶 (that’s not completely true, some esoteric exceptions exist)

Does this make sense? Any distribution with less-than-maximal entropy must have a reaso ason for this. Less entropy means not-as-uniform-as-possible, that is, undue peaks of probability mass. That is, reduced entropy = late tent assu assumptions, exactly what we want to avoid!

Optimal-worst-case Recall that through Kraft’s inequality probab ability ty di distr stribution ↔ enco codin ing The MaxEnt distribution for 𝐶 gives the mi mini nimu mum worst-case expected encoded length over an any y distribution that satisfies this background knowledge.

Some examples Mean and  interval? uniform  variance? Gaussian  positive? exponential  discrete? geometric  … But… what about distributions for like data, patterns, and stuff?

MaxEnt Theory To use MaxEnt, we need theo heory y for modelling data given background knowledge Patterns  itemset frequencies (Tatti ’06, Mampaey et al. ’11) Real-valued Data Binary Data  margins (De Bie ‘09)  margins (Kontonasios et al. ‘11)  sets of cells (Kontonasios et al. ‘13)  tiles (Tatti & Vreeken, ‘12)

Finding the MaxEnt distribution You can finding the MaxEnt distribution by solving the following system of linear constraints max − � 𝑞 𝑦 log 𝑞 ( 𝑦 ) 𝑞 ( 𝑦 ) 𝑦 𝑡 . 𝑢 . � 𝑞 𝑦 𝑔 𝑗 𝑦 = 𝛽 𝑗 for all 𝑗 𝑦 � 𝑞 𝑦 = 1 𝑦 * for discrete data

Exponential Form Let 𝑞 be a probability density satisfying the constraints for 1 ≤ 𝑗 ≤ 𝑛 �𝑞 𝑦 𝑔 𝑗 𝑦 𝑒𝑦 = 𝛽 𝑗 𝑇 then we can write the MaxEnt distribution as 𝑞 ∗ = 𝑞 𝜇 𝑦 ∝ � exp 𝜇 0 + � 𝜇 𝑗 𝑔 𝑗 𝑦 𝐸 ∉ 𝒶 𝑔 𝑗 ∈𝐶 0 𝐸 ∈ 𝒶 where we choose the lambdas, Lagrange multipliers, to satisfy the constraints, and where 𝒶 is a collection of databases s.t. 𝑞 ( 𝐸 ) = 0 for all 𝑞 ∈ 𝒬 (Csizar 1975)

Solving the MaxEnt The Lagrangian is 𝑀 𝑞 𝑦 , 𝜈 , 𝜇 = − � 𝑞 𝑦 log 𝑞 𝑦 + � 𝜇 𝑗 � 𝑞 𝑦 𝑔 𝑗 𝑦 − 𝛽 𝑗 + 𝜈 � 𝑞 𝑦 − 1 𝑦 𝑗 𝑗 𝑦 We set the derivative w.r.t. 𝑞 𝑦 to 0 and get 𝑞 𝑦 = 1/ 𝑎 𝜇 exp � 𝜇 𝑗 𝑔 𝑗 𝑦 𝑗 where 𝑎 𝜇 = ∑ exp is called the partitio ion n func nction ( ∑ 𝜇 𝑗 𝑔 𝑗 ( 𝑦 )) 𝑦 𝑗

En Garde! We may substitute 𝑞 ( 𝑦 ) in the Lagrangian to obtain the dua ual obje ject ctiv ive 𝑀 𝜇 = log 𝑎 𝜇 − � 𝜇 𝑗 𝛽 𝑗 𝑗 Minimizing the dual gives the maximal solution to the original problem. Moreover, it is conve vex. x.

Inferring the Model The problem is convex means we can use an any y convex optimization strategy. Standard approaches include iterative scaling, gradient descent, conjugate gradient descent, Newton’s method, etc.

Inferring the Model Optimization requires calculating p for datasets and tiles this is easy asy for itemsets and frequencies, however, this is PP PP-hard rd

MaxEnt for Binary Databases Constraints: the expec ected ed row and column margins 𝑛 � 𝑞 𝐸 � 𝑒 𝑗𝑗 = 𝑠 𝑗 𝐸∈ 0 , 1 𝑜 × 𝑛 𝑗=1 𝑜 � 𝑞 𝐸 � 𝑒 𝑗𝑗 = 𝑑 𝑗 𝐸∈ 0 , 1 𝑜 × 𝑛 𝑗=1 (De Bie 2010)

MaxEnt for Binary Databases Using the Lagrangian, we can solve 𝑞 𝐸 1 𝑠 + 𝜇 𝑗 𝑑 𝑞 𝐸 = � 𝑑 exp 𝑒 𝑗𝑗 𝜇 𝑗 𝑠 , 𝜇 𝑗 𝑎 𝜇 𝑗 𝑗 , 𝑗 where 𝑑 = 𝑠 + 𝜇 𝑗 𝑠 , 𝜇 𝑗 𝑑 𝑎 𝜇 𝑗 � exp 𝑒 𝑗𝑗 𝜇 𝑗 𝑒 𝑗𝑗 ∈ { 0 , 1 }

MaxEnt for Binary Databases Using the Lagrangian, we can solve 𝑞 𝐸 1 𝑠 + 𝜇 𝑗 𝑑 𝑞 𝐸 = � 𝑑 exp 𝑒 𝑗𝑗 𝜇 𝑗 𝑠 , 𝜇 𝑗 𝑎 𝜇 𝑗 𝑗 , 𝑗 Hey! 𝑞 𝐸 is a product of independent elements! That’s handy! We did id not enforce this property, it’s a consequence of MaxEnt. Following, every element is hence Bernoulli distributed, 𝑠 +𝜇 𝑗 𝑑 𝜇 𝑗 with a success probability of exp 𝑑 𝑠 +𝜇 𝑗 1+exp 𝜇 𝑗

What have you done for me lately? Okay, say we have this 𝑞 ∗ , what is it useful for? Given 𝑞 ∗ we can  sample data from 𝑞 ∗ , and compute empirical p-values (just like with swap randomization)  compute the likelihood of the observed data, and  compute how surprising our findings are given 𝑞 ∗ , and compute exact p-values

Expected vs. Actual Swap randomization and MaxEnt can both maintain margins. MaxEnt constrains the expec ected ed margins. Swap randomization constrains the act ctual l margins. Does this matter?

MaxEnt Theory To use MaxEnt, we need theo heory y for modelling data given background knowledge Binary Data Real-valued Data  margins (De Bie, ‘09)  margins (Kontonasios et al. ‘11)  tiles (Tatti & Vreeken, ‘12)  arbitrary sets of cells (now) allow for ite terative ve mining

MaxEnt for Real-Valued Data Current state of the art can incorporate means, ns, variance, and higher order moments, as well as histog ogram am information over arbitra trary sets of cells (Kontonasios et al. 2013)

Maximum Entropy & Subjective Interestingness Jill illes V s - PowerPoint PPT Presentation

Maximum Entropy & Subjective Interestingness Jill illes V s Vreeken 26 June une 2015 2015 Questions of the day How can we find things that are interesting with regard to what we already know ? How can we measure subjective

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Topic III.2: Maximum Entropy Models Discrete Topics in Data Mining Universitt des Saarlandes,

Comparison Between Bayesian and Maximum Entropy Analysis of Flow Networks 1 Maximum Entropy

Profiling user belief in BI exploration for measuring subjective interestingness Alexandre

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis Feature Expectation Matching

Maximum Entropy Tagging (for the Maximum Entropy method itself, refer to NPFL067 added slides

A gentle introduction to Maximum Entropy Models and their friends Mark Johnson Brown University

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

1) Entropy = measure of randomness 2) Entropy = measure of compressibility More random = Less

Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation Peng-Hua Wang Graduate Institute

Three measures of subjective social status Subjective measures of social Centers (1949):

Road detection via entropy By Anna Zaidman 1 1 What is entropy? Entropy is a mathematically

Entropy Change in Entropy Reversible Isobaric Process Ideal Gas in a Reversible Process Free

Schema.org Update Guha Outline of talk The context How did we end up where we are with the

Topological Phases of Matter Out of Equilibrium Nigel Cooper T.C.M. Group, Cavendish Laboratory,

4TH WORKSHOP ON RECONFIGURABLE COMPUTING FOR MACHINE LEARNING Organisers: Christos Bouganis,

Abelian, nilpotent and solvable quandles David Stanovsk y jointly with M. Bonatto, P. Jedli

Introduction to Database Systems Module 1, Lecture 1 M. Valenta - KSI CTU FIT in Prague

Fast direct solvers for elliptic partial differential equations on locally-perturbed geometries

Aoccdrnig to a rscheearch at Cmabrigde Uinerv4sy, it deosnt m8aer in waht oredr the l8eers in

The Racah algebra and multivariate Racah polynomials Hendrik De Bie Ghent University joint work