2019 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Modern MDL meets Data Mining Insight, Theory, and Practice Jilles Kenji Vreeken Yamanishi CISPA Helmholtz Center The University of Tokyo for Information Security
About the presenters Kenji Jilles Yamanishi Vreeken 2
About this tutorial Approximately 3.5 hours long Extensive, but in inco comple lete introduction to MDL theory MDL practice in in data min ining ng naturally a bit biased 3
Schedule 8:00am Opening 8:10am Introduction to MDL 8:50am MDL in Action 9:30am –––––– break –––––– 10:00am Stochastic Complexity 11:00am MDL in Dynamic Settings 4
Schedule 8:00am Opening 8:10am Introduction to MDL 8:50am MDL in Action 9:30am –––––– break –––––– 10:00am Stochastic Complexity 11:00am MDL in Dynamic Settings 5
Schedule 8:00am Opening 8:10am Introduction to MDL 8:50am MDL in Action 9:30am –––––– break –––––– 10:00am Stochastic Complexity 11:00am MDL in Dynamic Settings 6
Part 1 Intr trod oduc uction t on to o MDL Jilles V Vreeke ken 7
Induction by Simplicity “The simplest description of an object is the best” 8
Kolmogorov Complexity 𝑚 ( 𝑧 ) 𝑉 𝑧 halts and 𝑉 𝑧 = 𝑦 } 𝐿 𝑉 𝑦 = min 𝑧 The Kolmogorov complexity of a binary string 𝑦 is the length of the shortest program 𝑧 ∗ for a universal Turing Machine 𝑉 that generates s and halts. (Solomonoff 1960, Kolmogorov 1965, Chaitin 1969) 9
Kolmogorov Complexity alts and 𝑉 𝑧 = 𝑦 } 𝐿 𝑉 𝑦 = min 𝑚 ( 𝑧 ) 𝑉 𝑧 hal 𝑧 The Kolmogorov complexity of a binary string 𝑦 is the length of the shortest program 𝑧 ∗ for a universal Turing Machine 𝑉 that generates s and hal alts ts. (Solomonoff 1960, Kolmogorov 1965, Chaitin 1969) 10
Ultimately Impractical Kolmogorov complexity 𝐿 ( 𝑦 ) , or rather, the Kolmogorov optimal program 𝑦 ∗ is not computable. We can approximate it from above, but, this is not very practical. (simply not enough students to enumerate all Turing machines) We can approximate it through off-the-shelf compressors, yet, this has serious drawbacks. (big-O, what structure does a compressor reward, etc) 11
A practical variant A more viable alternative is the Min Minim imum De Descrip iptio ion Le Length principle “the best model is the model that gives the best lossless compression” There are two ways to motivate MDL we’ll discuss both at a high level then go into more details on what MDL is and can do 12
Two-Part MDL The Minimum Description Length (MDL) principle given a set of hypothes eses es ℋ , the best hy hypot othesis 𝐼 ∈ ℋ for given data 𝐸 is that 𝐼 that minimises 𝑀 𝐼 + 𝑀 ( 𝐸 ∣ 𝐼 ) in which 𝑀 ( 𝐼 ) is the length, in bits, of the description of 𝐼 𝑀 𝐸 𝐼 is the length, in bits, of the description of the data when encoded using 𝐼 (see, e.g., Rissanen 1978, 1983, Grünwald, 2007) 13
Bayesian Learning Bayes tells us that Pr( 𝐼 ∣ 𝐸 ) = Pr( 𝐸 ∣ 𝐼 ) × Pr( 𝐼 ) Pr( 𝐸 ) This means we want the 𝐼 that maximises Pr( 𝐼 ∣ 𝐸 ) . Since Pr( 𝐸 ) is the same for all models, we have to maxi aximise se Pr 𝐸 × Pr 𝐼 𝐼 Or, equivalently, min inim imis ise 𝐼 ) − log(Pr( 𝐼 )) − log (Pr 𝐸 14
From Bayes to MDL So, Bayesian Learning means min inim imis isin ing 𝐼 ) − log(Pr( 𝐼 )) − log (Pr 𝐸 Shannon tells us that the − log transform takes us from probabilities to optima mal pr pref efix-cod ode length gths This means we are actually minimizing 𝐼 𝑀 𝐼 + 𝑀 𝐸 for some encoding 𝑀 for 𝐼 resp. 𝐸 ∣ 𝐼 corresponding to distribution Pr 15
Bayesian MDL If we want to do MDL this way – i.e., being a Bayesian – we need to specify a prior probability Pr ( 𝑁 ) on the models, and a conditional probability Pr ( 𝐸 | 𝑁 ) on data given a model What are reasonable choices? 16
What Distribution to Use? For the data, this is ‘easy’: a maximum likelihood model a maximum entropy model for Pr( 𝐸 ∣ 𝑁 ) makes most sense For the models, this is ‘harder’, we could, e.g., use ‘whatever the expert says is a good distribution’, or an uninformative prior on 𝑁 , or (a derivative of) the universal prior from algorithmic statistics These are not easy to compute, query, and ad ad hoc hoc. In MDL we say, if we are going to be ad hoc, let us do so ope penl nly and use expl plicit uni universal e enco ncodings 17
Information Criteria MDL might make you think of either Aka kaik ike’s Inf Infor ormation C on Criterion (AIC) 𝑙 − ln(Pr( 𝐸 | 𝐼 )) or the Bayesian Inf Infor ormation C on Criterion n (BIC) k (Pr 𝐸 𝐼 ) 2 ln 𝑜 − ln 18
Information Criteria MDL might make you think of either Aka kaik ike’s Inf Infor ormation C on Criterion (AIC) 𝑙 − 𝑀 ( 𝐸 | 𝐼 ) or the Bayesian Inf Infor ormation C on Criterion n (BIC) k 2 ln 𝑜 − 𝑀 ( 𝐸 | 𝐼 ) 19
Information Criteria MDL might make you think of either Aka kaik ike’s Inf Infor ormation C on Criterion (AIC) 𝑀 𝐵𝐵𝐵 𝐼 = 𝑙 or the Bayesian Inf Infor ormation C on Criterion n (BIC) 𝑀 𝐶𝐵𝐵 𝐼 = k 2 ln 𝑜 We, however, do not o not assume that all parameters are created equal, we take their complexity into account 20
From Kolmogorov to MDL Both Kolmogorov complexity and MDL are based on compression. Is there a relationship between the two? Ye Yes. We can derive two wo-par art M t MDL from Kolmogorov complexity. We’ll sketch here how. (see, e.g., Li & Vitanyi 1996, Vereshchagin & Vitanyi 2004 for details) 21
Objects and Sets Recall that in Algorithmic Information Theory we are looking for (optimal) descriptions of obje ject cts. One way to describe an object is describe a set of which it is a member point int out which ch of these members it it is is. In fact, we do this all the time the beach (i.e., the set of all beaches) over there (pointing out a specific one) 22
Algorithmic Statistics We have, a set 𝑇 which we call a mod model which has complexity 𝐿 ( 𝑇 ) and an object 𝑦 ∈ 𝑇 𝑇 is a model of 𝑦 the complexity of pointing out 𝑦 in 𝑇 is the complexity of 𝑦 given 𝑇 , i.e. 𝐿 ( 𝑦 ∣ 𝑇 ) Obviously, 𝐿 𝑦 ≤ 𝐿 𝑇 + 𝐿 ( 𝑦 ∣ 𝑇 ) 23
So? Algorithmic Information Theory states that every program that outputs 𝑦 and halts encodes the infor orma mation on in in 𝑦 the smallest such program encodes only y the e in informatio ion n in in 𝑦 If 𝑦 is a data set, i.e. a rand ndom om s samp mple, we expect it has epis istemic ic struc uctur ure, “true” structure; captured by 𝑇 aleat atoric struc uctur ure, “accidental” structure; captured by 𝑦 ∣ 𝑇 We are hence interested in that model 𝑇 that minimizes 𝑇 𝐿 𝑇 + 𝐿 𝑦 which is surprisingly akin to two-part MDL 24
More detail For 𝐿 ( 𝑇 ) this is simply the length of the shortest program that outputs 𝑇 and halts; i.e., a gen generative e model of 𝑦 For 𝐿 ( 𝑦 ∣ 𝑇 ) if 𝑦 is a typi pica cal element of 𝑇 there is no more efficient way to find 𝑦 in 𝑇 than by an index, i.e., ( 𝑇 ) 𝐿 𝑦 𝑇 ≈ log 25
Kolmogorov’s Structure Function This suggests a way to discover the best model. Kolmogorov’s structure function is defined as ∣ 𝑦 ∈ 𝑇 , 𝐿 𝑇 ≤ 𝑗 } ℎ 𝑦 𝑗 = min 𝑇 {log 𝑇 That is, we start with very simple – in terms of complexity – models and gradually work our way up (see, e.g., Li & Vitanyi 1996, Vereshchagin & Vitanyi 2004) 26
The MDL function This suggests a way to discover the best model. Kolmogorov’s structure function is defined as ∣ 𝑦 ∈ 𝑇 , 𝐿 𝑇 ≤ 𝑗 } ℎ 𝑦 𝑗 = min 𝑇 {log 𝑇 which defines the MDL function as ∣ 𝑦 ∈ 𝑇 , 𝐿 𝑇 ≤ 𝑗 } 𝜇 𝑦 𝑗 = min 𝑇 { 𝐿 𝑇 + log 𝑇 We try to find the minimum by considering increasingly complex models. (see Vereshchagin & Vitanyi 2004) 27
The MDL function This suggests a way to discover the best model. Kolmogorov’s structure function is defined as ∣ 𝑦 ∈ 𝑇 , 𝐿 𝑇 ≤ 𝑗 } ℎ 𝑦 𝑗 = min 𝑇 {log 𝑇 which defines the MDL function as ∣ 𝑦 ∈ 𝑇 , 𝐿 𝑇 ≤ 𝑗 } 𝜇 𝑦 𝑗 = min 𝑇 { 𝐿 𝑇 + log 𝑇 We try to find the minimum by considering increasingly complex models. (see Vereshchagin & Vitanyi 2004) 28
Two-Part MDL The Minimum Description Length (MDL) principle given a set of hypothes eses es ℋ , the best hy hypot othesis 𝐼 ∈ ℋ for given data 𝐸 is that 𝐼 that minimises 𝑀 𝐼 + 𝑀 ( 𝐸 ∣ 𝐼 ) in which 𝑀 ( 𝐼 ) is the length, in bits, of the description of 𝐼 𝑀 𝐸 𝐼 is the length, in bits, of the description of the data when encoded using 𝐼 (see, e.g., Rissanen 1978, 1983, Grünwald, 2007) 29
Recommend
More recommend