maximum entropy subjective interestingness
play

Maximum Entropy & Subjective Interestingness Jill illes V s - PowerPoint PPT Presentation

Maximum Entropy & Subjective Interestingness Jill illes V s Vreeken 26 June une 2015 2015 Questions of the day How can we find things that are interesting with regard to what we already know ? How can we measure subjective


  1. Maximum Entropy & Subjective Interestingness Jill illes V s Vreeken 26 June une 2015 2015

  2. Questions of the day How can we find things that are interesting with regard to what we already know ? How can we measure subjective interestingness ?

  3. What is interesting? something that increa ease ses our knowledge about the data

  4. What is a good result? something that red educ uces our uncertainty about the data (ie. increases the likelihood of the data)

  5. What is really good? something that, in simpl ple term rms, st strongly r red educ uces our uncertainty about the data (maximise likelihood, but avoid overfitting)

  6. Let’s make this visual universe  of possible datasets our dataset D

  7. dimensions, margins Given what we know possible datasets, given current knowledge all possible our dataset D datasets

  8. dimensions, margins, More knowledge... pattern P 1 all possible our dataset D datasets

  9. dimensions, margins, Fewer possibilities... patterns P 1 and P 2 all possible our dataset D datasets

  10. dimensions, margins, Less uncertainty. the key structure all possible our dataset D datasets

  11. dimensions, margins, Maximising certainty patterns P 1 and P 2 knowledge added by P 2 all possible our dataset D datasets

  12. How can we define ‘uncertainty’ and ‘simplicity’? interpr rpreta tability ty and informativen eness ess are intrinsically subjective

  13. Measuring Uncertainty We need access to the likelihood ood of data D given background knowledge B 𝑞 ( 𝐸 ∣ 𝐶 ) such that we can calculate the gain for X − 𝑞 ( 𝐸 ∣ 𝐶 ) 𝑞 𝐸 𝐶 ∪ 𝑌 …which distribution should we use?

  14. Measuring Surprise We need access to the likelihood of result X given background knowledge B 𝑞 ( 𝑌 ∣ 𝐶 ) such that we can mine the data for X that have a low likelihood, that are surp rpris rising …which distribution should we use?

  15. Approach 2: Maximum Entropy ‘the best distribution 𝑞 ∗ satisfies the background knowledge, but makes no further assumptions’ (Jaynes 1957; De Bie 2009)

  16. Approach 2: Maximum Entropy ‘the best distribution 𝑞 ∗ sati satisfies th the bac ackgr ground knowledge dge, but makes no further assumptions’ in other words, 𝑞 ∗ assigns the correct probability mass to the background knowledge instances: 𝑞 ∗ is a maximum likelihood estimator (Jaynes 1957; De Bie 2009)

  17. Approach 2: Maximum Entropy ‘the best distribution 𝑞 ∗ satisfies the background knowledge, but ma makes no no fur urthe her a assum umption ons’ in other words, 𝑞 ∗ spreads probability mass around as evenly as possible: 𝑞 ∗ does not have any specific bias (Jaynes 1957; De Bie 2009)

  18. Approach 2: Maximum Entropy ‘the best distribution 𝑞 ∗ satisfies the background knowledge, but makes no further assumptions’ ver ery usef useful for data mining: unbiased sed measurement of subject ective e interest estingness ness (Jaynes 1957; De Bie 2009)

  19. Constraints and Distributions Let 𝐶 be our set of constraints 𝑜 } 𝐶 = { 𝑔 1 , … , 𝑔 Let 𝐷 be the set of admissible distributions 𝑗 ∈ 𝐶 } 𝐷 = 𝑞 ∈ 𝐐 𝑞 𝑔 𝑗 = 𝑞 � 𝑔 𝑗 for 𝑔 We need the most uni unifor ormly distributed 𝑞 ∈ 𝐐

  20. Uniformity and Entropy Uniformity ↔ Entropy 𝐼 𝑞 = − � 𝑞 ( 𝑌 = 𝑦 )log 𝑞 ( 𝑌 = 𝑦 ) 𝑦∈𝐘 tells us the ent ntrop opy of a (discrete) distribution 𝑞

  21. Maximum Entropy We want access to the distribution 𝑞 ∗ with maximum mum entro ropy ∗ = argmax 𝑞∈𝐷 𝐼 ( 𝑞 ) 𝑞 𝐶 better known as the ma maximum um e ent ntropy mod model for constraints set 𝐶

  22. Maximum Entropy We want access to the distribution 𝑞 ∗ with maximum mum entro ropy ∗ = argmax 𝑞∈𝐷 𝐼 ( 𝑞 ) It can be shown that 𝑞 ∗ is well defined 𝑞 𝐶 there always exist a unique 𝑞 ∗ with maximum entropy for any constrained set 𝐷 better known as the ma maximum um e ent ntropy mod model for constraints set 𝐶 (that’s not completely true, some esoteric exceptions exist)

  23. Does this make sense? Any distribution with less-than-maximal entropy must have a reaso ason for this. Less entropy means not-as-uniform-as-possible, that is, undue peaks of probability mass. That is, reduced entropy = late tent assu assumptions, exactly what we want to avoid!

  24. Optimal-worst-case Recall that through Kraft’s inequality probab ability ty di distr stribution ↔ enco codin ing The MaxEnt distribution for 𝐶 gives the mi mini nimu mum worst-case expected encoded length over an any y distribution that satisfies this background knowledge.

  25. Some examples Mean and  interval? uniform  variance? Gaussian  positive? exponential  discrete? geometric  … But… what about distributions for like data, patterns, and stuff?

  26. MaxEnt Theory To use MaxEnt, we need theo heory y for modelling data given background knowledge Patterns  itemset frequencies (Tatti ’06, Mampaey et al. ’11) Real-valued Data Binary Data  margins (De Bie ‘09)  margins (Kontonasios et al. ‘11)  sets of cells (Kontonasios et al. ‘13)  tiles (Tatti & Vreeken, ‘12)

  27. Finding the MaxEnt distribution You can finding the MaxEnt distribution by solving the following system of linear constraints max − � 𝑞 𝑦 log 𝑞 ( 𝑦 ) 𝑞 ( 𝑦 ) 𝑦 𝑡 . 𝑢 . � 𝑞 𝑦 𝑔 𝑗 𝑦 = 𝛽 𝑗 for all 𝑗 𝑦 � 𝑞 𝑦 = 1 𝑦 * for discrete data

  28. Exponential Form Let 𝑞 be a probability density satisfying the constraints for 1 ≤ 𝑗 ≤ 𝑛 �𝑞 𝑦 𝑔 𝑗 𝑦 𝑒𝑦 = 𝛽 𝑗 𝑇 then we can write the MaxEnt distribution as 𝑞 ∗ = 𝑞 𝜇 𝑦 ∝ � exp 𝜇 0 + � 𝜇 𝑗 𝑔 𝑗 𝑦 𝐸 ∉ 𝒶 𝑔 𝑗 ∈𝐶 0 𝐸 ∈ 𝒶 where we choose the lambdas, Lagrange multipliers, to satisfy the constraints, and where 𝒶 is a collection of databases s.t. 𝑞 ( 𝐸 ) = 0 for all 𝑞 ∈ 𝒬 (Csizar 1975)

  29. Solving the MaxEnt The Lagrangian is 𝑀 𝑞 𝑦 , 𝜈 , 𝜇 = − � 𝑞 𝑦 log 𝑞 𝑦 + � 𝜇 𝑗 � 𝑞 𝑦 𝑔 𝑗 𝑦 − 𝛽 𝑗 + 𝜈 � 𝑞 𝑦 − 1 𝑦 𝑗 𝑗 𝑦 We set the derivative w.r.t. 𝑞 𝑦 to 0 and get 𝑞 𝑦 = 1/ 𝑎 𝜇 exp � 𝜇 𝑗 𝑔 𝑗 𝑦 𝑗 where 𝑎 𝜇 = ∑ exp is called the partitio ion n func nction ( ∑ 𝜇 𝑗 𝑔 𝑗 ( 𝑦 )) 𝑦 𝑗

  30. En Garde! We may substitute 𝑞 ( 𝑦 ) in the Lagrangian to obtain the dua ual obje ject ctiv ive 𝑀 𝜇 = log 𝑎 𝜇 − � 𝜇 𝑗 𝛽 𝑗 𝑗 Minimizing the dual gives the maximal solution to the original problem. Moreover, it is conve vex. x.

  31. Inferring the Model The problem is convex means we can use an any y convex optimization strategy. Standard approaches include iterative scaling, gradient descent, conjugate gradient descent, Newton’s method, etc.

  32. Inferring the Model Optimization requires calculating p for datasets and tiles this is easy asy for itemsets and frequencies, however, this is PP PP-hard rd

  33. MaxEnt for Binary Databases Constraints: the expec ected ed row and column margins 𝑛 � 𝑞 𝐸 � 𝑒 𝑗𝑗 = 𝑠 𝑗 𝐸∈ 0 , 1 𝑜 × 𝑛 𝑗=1 𝑜 � 𝑞 𝐸 � 𝑒 𝑗𝑗 = 𝑑 𝑗 𝐸∈ 0 , 1 𝑜 × 𝑛 𝑗=1 (De Bie 2010)

  34. MaxEnt for Binary Databases Using the Lagrangian, we can solve 𝑞 𝐸 1 𝑠 + 𝜇 𝑗 𝑑 𝑞 𝐸 = � 𝑑 exp 𝑒 𝑗𝑗 𝜇 𝑗 𝑠 , 𝜇 𝑗 𝑎 𝜇 𝑗 𝑗 , 𝑗 where 𝑑 = 𝑠 + 𝜇 𝑗 𝑠 , 𝜇 𝑗 𝑑 𝑎 𝜇 𝑗 � exp 𝑒 𝑗𝑗 𝜇 𝑗 𝑒 𝑗𝑗 ∈ { 0 , 1 }

  35. MaxEnt for Binary Databases Using the Lagrangian, we can solve 𝑞 𝐸 1 𝑠 + 𝜇 𝑗 𝑑 𝑞 𝐸 = � 𝑑 exp 𝑒 𝑗𝑗 𝜇 𝑗 𝑠 , 𝜇 𝑗 𝑎 𝜇 𝑗 𝑗 , 𝑗 Hey! 𝑞 𝐸 is a product of independent elements! That’s handy! We did id not enforce this property, it’s a consequence of MaxEnt. Following, every element is hence Bernoulli distributed, 𝑠 +𝜇 𝑗 𝑑 𝜇 𝑗 with a success probability of exp 𝑑 𝑠 +𝜇 𝑗 1+exp 𝜇 𝑗

  36. What have you done for me lately? Okay, say we have this 𝑞 ∗ , what is it useful for? Given 𝑞 ∗ we can  sample data from 𝑞 ∗ , and compute empirical p-values (just like with swap randomization)  compute the likelihood of the observed data, and  compute how surprising our findings are given 𝑞 ∗ , and compute exact p-values

  37. Expected vs. Actual Swap randomization and MaxEnt can both maintain margins. MaxEnt constrains the expec ected ed margins. Swap randomization constrains the act ctual l margins. Does this matter?

  38. MaxEnt Theory To use MaxEnt, we need theo heory y for modelling data given background knowledge Binary Data Real-valued Data  margins (De Bie, ‘09)  margins (Kontonasios et al. ‘11)  tiles (Tatti & Vreeken, ‘12)  arbitrary sets of cells (now) allow for ite terative ve mining

  39. MaxEnt for Real-Valued Data Current state of the art can incorporate means, ns, variance, and higher order moments, as well as histog ogram am information over arbitra trary sets of cells (Kontonasios et al. 2013)

Recommend


More recommend